Preface
"My friend Milla Jovovich and I spent months creating an AI memory system with Claude. It just posted a perfect score on the standard benchmark --- beating every product in the space, free or paid.
It's called MemPalace, and it works nothing like anything else out there."
--- Ben Sigman (@bensig)
In early 2026, this tweet sent a modest shockwave through the tech community. The shock was not that yet another AI memory product had appeared --- the market has never lacked those --- but rather two unusual facts: first, it achieved the first-ever perfect score on the LongMemEval benchmark (500/500, R@5 = 100%); second, the pairing of its two founders was genuinely unexpected.
Ben Sigman, a UCLA Classics degree, over twenty years of systems engineering experience, CEO of Bitcoin Libre, a tech entrepreneur who had spent years deep in decentralized lending markets, and also the author of Bitcoin One Million. Milla Jovovich --- yes, that Milla Jovovich, the Hollywood actress with five million Instagram followers, whose GitHub bio reads "architect of the MemPalace."
Milla Jovovich — from The Fifth Element to the architect of MemPalace.
A classically trained systems engineer, a Hollywood actress, and a system they spent months building with Claude, beating every commercial product and academic system.
This fact alone warrants serious examination.
Why This Book Was Written
MemPalace accumulated over two thousand GitHub stars in a short period, fully open-sourced under the MIT license. Brian Roemmele --- founder of The Zero-Human Company --- said after testing: "We have been testing MemPalace... absolutely blown away! It is a freaking masterpiece and we have deployed it to 79 employees." Wayne Sutton put it more bluntly: "Milla Jovovich launching an AI memory system with Claude was not on my 2026 list." LLMJunky summarized: "She's co-developed the highest-scoring AI memory system ever benchmarked. Totally free and OSS. What a boss."
Yet the discussion around MemPalace has largely remained on two levels: surprise at the founders' identities, and amplification of the benchmark scores. Very few have analyzed in depth why its design works, what choices it made that fundamentally differ from mainstream AI memory systems, and the engineering trade-offs behind those choices.
This book attempts to fill that gap.
This is not a tutorial --- it is a design analysis. You will not find "step one: install, step two: configure" instructions here --- MemPalace's README and documentation already do that well enough. This book is concerned with deeper questions: Why can ancient Greek memory techniques be effective again in the era of large language models? Why can a zero-API-call local system achieve 96.6% retrieval precision, reaching 100% with a single lightweight reranking step? Why can a compression dialect designed for AI aim for extremely high compression while still trying to preserve factual structure? Why does abandoning "let the AI decide what's worth remembering" actually produce better results?
Behind every design decision lies a concrete engineering problem. This book's job is to make the relationship between those problems and decisions clear.
What This Book Is Not
Several things need to be stated upfront.
First, this is not MemPalace's official documentation. This book is an independent, third-party technical analysis based on publicly available source code, benchmark data, and design documents. The analysis represents the author's understanding, not the project founders' intentions.
Second, this is not a survey of AI memory systems. While we will compare with other approaches where necessary to illustrate the uniqueness of design choices, the book's focus remains on MemPalace's own architectural logic.
Third, this is not a hands-on guide for building a similar system from scratch. The intended audience is technical practitioners and researchers who already have some AI engineering experience and are interested in memory system design. If you are designing your own AI memory solution, this book can help you understand an approach that has been proven effective; but it will not walk you through writing code.
What MemPalace Does Differently
Before entering the main text, it is worth sketching MemPalace's core design choices in a few paragraphs so readers can build an initial mental model.
Mainstream AI memory systems follow a common paradigm: have the model extract "important information" during conversations, store it in a vector database, and retrieve via semantic similarity matching. The problem with this paradigm is that it introduces irreversible information loss at the storage stage --- the model extracts "user prefers PostgreSQL" but loses all the context from that two-hour conversation where you explained why you migrated away from MongoDB.
MemPalace's core stance is: Store everything, then let structure make it retrievable.
This stance gave rise to three key designs:
The Memory Palace Structure. Borrowing from the ancient Greek orators' memory technique --- remembering an entire speech by placing ideas in different rooms of an imaginary building --- MemPalace organizes your memories as Wings (people and projects), Halls (memory types), and Rooms (specific concepts). This spatial metaphor is not decoration; it is a real retrieval acceleration mechanism: structural organization alone produced a 34% retrieval precision improvement.
The AAAK Compression Dialect. This is a shorthand language designed specifically for AI agents. It is not meant for humans to read --- it is meant for your AI to read, and it reads fast. For structured information such as teams, projects, and decisions, format compression can often reach 5-10x while preserving factual assertions; for long, redundant conversation logs, the README's 30x claim comes from combining structured expression with content selection. That is powerful, but not identical to a blanket "zero-loss" guarantee for the current plain-text compressor. Because AAAK is essentially structured text with universal grammar, it works with any model that can read text --- Claude, GPT, Gemini, Llama, Mistral --- no decoder needed, no fine-tuning, no cloud API.
The Four-Layer Memory Stack. From temporary working memory to long-term persistence, MemPalace simulates the memory hierarchy from cognitive science. Different layers have different lifespans, different compression strategies, and different retrieval paths. This is not a flat key-value store --- it is a time-aware knowledge graph.
These three designs are interwoven, collectively explaining the seemingly incredible benchmark numbers: LongMemEval R@5 perfect score, 96.6% with zero API calls, ConvoMem 92.9%, LoCoMo 100% (with reranking; baseline 60.3%, see Chapter 23 for the honest analysis).
Recommended Reading Paths
This book is divided into nine parts and twenty-five chapters. Based on your background and interests, here are several different reading paths.
Path One: Systems Architect
If you are an engineer currently designing AI memory systems or knowledge management systems, the recommended reading order is:
- Part 1 (Chapters 1--3): Problem Space --- Understand the core problem MemPalace aims to solve, and why existing approaches fall short. This is the motivation for all subsequent design decisions.
- Part 2 (Chapters 4--7): Memory Palace Structure --- How the spatial metaphor translates into engineering structure, and how the Wing-Hall-Room three-tier system achieves retrieval acceleration.
- Part 4 (Chapters 11--13): Temporal Knowledge Graph --- How MemPalace encodes time dimensions in graph structures, making "the discussion about X from last year" a computable query.
- Part 5 (Chapters 14--15): Four-Layer Memory Stack --- The layered strategy from working memory to long-term storage.
- Part 8 (Chapters 22--23): Validation --- Benchmark design and results analysis.
This path covers the system's skeleton, giving you an understanding of the overall architecture, after which you can trace back to other chapters for details as needed.
Path Two: AI Application Developer
If you are more interested in how to integrate similar memory capabilities into your own AI applications, start with these chapters:
- Part 1 (Chapters 1--3): Problem Space --- Start with the problem, as before.
- Part 3 (Chapters 8--10): AAAK Compression Language --- Understand how this AI-oriented compression dialect was designed, where its current heuristic implementation stops, and why its text-first format works on any model. This may be MemPalace's most original contribution.
- Part 6 (Chapters 16--18): Data Ingestion Pipeline --- The complete flow from raw conversation data to structured memory.
- Part 7 (Chapters 19--21): Interface Design --- MCP toolset, command-line interface, and local model integration.
- Part 9 (Chapters 24--25): Design Philosophy and the Future --- Broader implications of MemPalace's design philosophy for AI application development.
This path emphasizes transferable design patterns and integration approaches.
Path Three: Quick Overview
If your time is limited and you just want to understand why MemPalace works, you can read only Chapter 1 (problem definition), Chapter 4 (palace structure overview), Chapter 8 (AAAK core principles), and Chapter 22 (benchmark results). Four chapters, roughly two hours, sufficient for a complete high-level understanding.
Of course, you can also read straight through from beginning to end. The book's structure is arranged in logical progression: from problem to solution, from solution to implementation, from implementation to validation, from validation to reflection.
A Footnote on Background
Ben Sigman's Classics degree is not an irrelevant biographical detail. MemPalace's core metaphor --- the memory palace, also known as the Method of Loci --- is a central technique of the ancient Greek and Roman rhetorical tradition. Cicero described this method in detail in De Oratore: an orator walks through an imaginary building, recalling one argument at each location passed. Over two thousand years later, the same spatial metaphor has proven equally effective for large language model memory retrieval.
This is not a coincidence. Spatial structure works in both human memory and AI memory because it provides an organizational dimension orthogonal to the content itself. When you no longer need to remember "where the information is" (because structure already tells you), you can devote all cognitive resources to understanding the information itself. This principle does not depend on whether the memory substrate is a brain or a language model.
A systems engineer who studied Classics recognized this. A Hollywood actress --- who is also a serious technical contributor --- helped turn that insight into working code. Then Claude helped them build it.
This combination looks impossible, but the results speak for themselves: a perfect score.
Acknowledgments and Disclosures
This book is based on analysis of MemPalace's public source code (github.com/milla-jovovich/mempalace, MIT license), official documentation, and public benchmark data. All cited tweets and comments come from publicly posted social media content.
Thanks to Ben Sigman and Milla Jovovich for open-sourcing this system, making this kind of deep analysis possible. Thanks to the contributors in the MemPalace community who have shared their usage experiences and test data.
Let us begin.
Chapter 0: Building Things with Claude
At some point in 2025, a Hollywood actress and the CEO of a Bitcoin company started writing software with an AI.
That sentence sounds like the opening of a Silicon Valley parable, but it actually happened. The project was called MemPalace, an AI memory system. A few months later, it posted benchmark results that were unusually strong by the standards of public AI memory systems. And the most interesting part of the entire story is not the final result, but how the thing was built.
Two Unlikely Partners
Ben Sigman's career follows an uncommon trajectory. He studied Classics at UCLA --- ancient Greek, Latin, Cicero, and rhetoric. Then he spent twenty years in systems engineering. Then he founded Bitcoin Libre and became CEO.
These three chapters of his career seem entirely unrelated, yet they converge in MemPalace. Classics gave him a key concept: the Method of Loci, also known as the "memory palace" technique. This was a memory technology used by ancient Greek and Roman orators --- placing information to be remembered in different rooms of an imaginary building, then mentally walking through that building to extract information room by room when recall was needed. Cicero used this method to memorize long speeches. Medieval monks used it to memorize entire volumes of scripture. The method works because the human brain is inherently skilled at spatial memory --- our ability to remember "where things are" far exceeds our ability to remember "what things are."
Twenty years of systems engineering gave him a different kind of intuition: how to turn an elegant concept into runnable software. Not the "proof of concept" kind found in academic papers, but something that actually runs in production. He knew what complexity was acceptable, what dependencies should be avoided, and what architecture would still be maintainable three years later.
Milla Jovovich is another name unlikely to appear in this story. She is better known as Leeloo in The Fifth Element and Alice in the Resident Evil series. But her GitHub bio reads "architect of the MemPalace." The project is hosted under her GitHub account.
This is not a celebrity lending their name. From the project's commit history and version iterations, MemPalace went through multiple major refactors before stabilizing at v3.0.0. That depth of iteration does not come from lending a name. It means repeated discussion, reversal, and rebuilding.
The Third Collaborator
Then there is Claude.
When Ben discussed this project on social media, his choice of words is worth noting. He said "spent months creating an AI memory system with Claude" --- not "written using Claude" (Claude as tool), not "had Claude write it" (Claude as executor), but "created with Claude" --- Claude as collaborator.
This subtle distinction in wording points to a new mode of work that, as of 2025, has not yet been well named.
Over the past two years, human-AI code collaboration has coalesced into two mainstream patterns. The first is "AI generates, human reviews" --- the human describes requirements, the AI generates code, the human inspects, modifies, and merges. This is the typical GitHub Copilot workflow. The second is "human leads, AI assists" --- the human writes the core logic, asks the AI when uncertain, and the AI offers suggestions or code snippets.
The MemPalace development process appears to belong to neither.
The codebase's structure offers some clues. The project is written in Python, containing roughly 30 modules, each with a single responsibility and clear boundaries. This structure itself tells a story: someone was making holistic architectural decisions --- which features should be independent modules, how modules should communicate, what should be exposed as public interfaces. These decisions require a global understanding of the entire system and a personal judgment about "what software should look like." This is not a result that line-by-line code generation can produce.
At the same time, the project's dependency list is remarkably short: chromadb and pyyaml, nothing else. A system involving vector search, semantic retrieval, knowledge graphs, data compression, and multi-format parsing uses only two external dependencies. This indicates that a large amount of functionality was implemented from scratch rather than assembled from third-party libraries. This inclination toward "build it yourself if you can" typically comes from an experienced engineer's deep understanding of dependency management --- every additional dependency is one more thing that might wake you at 3 AM someday.
Yet simultaneously, implementing 30 modules is a substantial workload for two people, especially within a "few months" timeframe. A reasonable inference is that Claude handled a large portion of the implementation work, while the human collaborators were responsible for architectural decisions, domain knowledge injection, and quality control.
The "domain knowledge" here is not programming knowledge in the general sense. What Ben brought was the wisdom of ancient Greeks from two thousand years ago about memory, and the intuition accumulated over twenty years of systems engineering about "what can survive in production." These things do not exist in any AI's training data --- at least not in a form that can be directly applied. They require a person to translate classical concepts into the language of software architecture, then let the AI implement them.
Traces of Iteration
The version number 3.0.0 is itself a story.
A software project reaching 3.0 means it has undergone at least two major refactors. Not minor patch-level upgrades, but "this direction isn't working, tear it down and start over" level changes. Each major version iteration typically means the developers' understanding of the problem underwent a fundamental shift --- not "this function should be written differently," but "we have been solving the wrong problem."
From the project's git history, the development process was iterative. The commit log shows a gradual evolution, not a finished product that suddenly appeared from nothing one day. This is consistent with the statement about "working with an AI for several months." It was not a weekend hackathon project, nor a product "generated in one shot" by an AI. It was the result of repeated experimentation and correction.
One can imagine the outline of this process (though specific details cannot be confirmed from public information): early versions may have tried a simpler memory approach and found the results inadequate; then the spatial structure concept of the "palace" was introduced, and retrieval accuracy improved significantly; later, performance issues may have arisen when handling large-scale data, leading to the development of the AAAK compression dialect to address context window limitations. Each step was not planned in advance but emerged from discovering, understanding, and solving problems in practice.
During this iterative process, the human-AI collaboration most likely was not static. In the early exploratory phase, human intuition and judgment probably dominated --- an insight like "we should organize data using the memory palace approach" would not come from an AI. In the middle implementation phase, the AI's code generation capability was likely fully utilized --- turning concepts into runnable modules. In the late optimization phase, it probably returned to intensive human-AI dialogue --- "why won't this benchmark score go up? Is it the retrieval logic or the data organization?"
Why This Matters
In 2025, "writing code with AI" is no longer news. Every day, thousands of developers use Copilot, Cursor, and Claude to accelerate their programming work. Most of the time, these tools are used as smarter auto-complete --- the human writes one line of comment, and the AI fills in five lines of code.
The MemPalace case is interesting because it hints at a different possibility.
When a person with a Classics background, a person with a Hollywood background, and an AI sit down together for months and produce a work that beats all existing systems on academic benchmarks --- the significance is not "AI is amazing" or "these two people are amazing," but in the combination itself.
If Ben had not been trained in Classics, he would not have thought to use a two-thousand-year-old memory technique to organize AI data. The Method of Loci is not in any standard "AI memory system" technology stack. It came from an entirely different knowledge domain, brought into this project by someone who happened to understand both domains.
Without twenty years of systems engineering experience, the "two dependencies" decision would not have happened. A less experienced team facing the same requirements would likely have pulled in a dozen libraries to "ship quickly," then been buried by dependency hell six months later. A minimal dependency strategy is not conservatism; it is judgment born from experience.
Without Claude's participation, completing the development, testing, and iteration to v3.0 of 30 modules within a few months would have been extremely difficult for two people. The AI here is not a nice-to-have assistive tool; it is the key factor that made this project possible within the given time and staffing constraints.
All three are indispensable. Classics provided the core insight, engineering experience provided architectural judgment, and AI provided implementation bandwidth. This is not a story about "AI replacing programmers" --- quite the opposite. It is a story about "human cross-domain knowledge becoming more valuable in the AI era." Because AI can write code, but it will not on its own go read Cicero's De Oratore and then have a flash of inspiration: "hey, a two-thousand-year-old memory technique can solve the 2025 AI context management problem."
That kind of connection --- spanning time, spanning disciplines --- remains a uniquely human capability. And the role of AI is to ensure that once these connections are made, they can be turned into working systems at unprecedented speed.
So What Did They Build?
Months of collaboration produced a memory system. It scored 96.6% on the LongMemEval benchmark raw --- no external API calls, no cloud service dependencies in the raw path, and a largely local runtime once assets are prepared. With a lightweight reranking step, the benchmark reached 100% on the full 500-question set; elsewhere in the book we also keep the cleaner held-out 450 score separate rather than collapsing everything into one number.
Among all publicly available AI memory systems, free or paid, there is no higher score.
The result itself is impressive. But the more important question is: how did they do it? How did a system that depends on only two Python packages, runs entirely locally, and requires no API keys surpass commercial products backed by ample engineering resources and cloud computing budgets?
The answer lies in a concept from two thousand years ago.
The core principle of the memory palace technique is not "remember more" but "make information findable." Ancient Greek orators did not memorize long speeches by rote --- they placed each argument at a specific location in an imaginary building, and when needed, they walked along the route and naturally extracted each point in order. The key insight is that spatial structure itself is an index.
MemPalace applies the same principle to AI memory. Every conversation with an AI, every decision, every debugging session --- this information is not dumped into one giant unstructured text heap but placed in a "palace" with Wings, corridors, and Rooms. When you ask "why did we abandon GraphQL three months ago," the system does not need to scan all memories --- it knows which Wing and which Room to look in.
This single structural improvement alone raised retrieval accuracy by 34%.
But this is only the beginning of the story. How is the palace structure defined? By what rules are Rooms divided? What happens when the same topic appears in multiple different contexts? How do you fit months of memory into a limited context window? How do you detect when memories contradict each other?
The answers to these questions form the entire content of the rest of this book. Starting from the next chapter, we will disassemble every component of this palace --- not as user documentation for an open-source project, but as a complete record of design decisions: what problem was faced, what alternatives were considered, why the final approach was chosen, and the engineering trade-offs behind those choices.
This palace is worth walking into and examining closely.
Chapter 1: Conversation as Decision
Positioning: This chapter reveals a paradigm shift that is already underway but not yet fully recognized --- the primary arena for technical decisions has migrated from traditional tools to AI conversations, and these decision records are systematically evaporating.
Monday Morning, 9:13 AM
Lin Yuan opens his terminal and launches Claude. He needs to choose an authentication solution for his team's SaaS analytics platform.
This is not a simple technology selection. Auth0's pricing model becomes expensive once user counts exceed ten thousand; Clerk offers a better developer experience but has a smaller community ecosystem; building in-house means at least three weeks of development. Lin Yuan pours all three options, the team's tech stack constraints, budget ceiling, and the pitfalls they hit with Auth0 last quarter into the conversation window.
Forty minutes later, the decision is made. Clerk. The reasons: more transparent pricing model, cleaner SDK integration with Next.js, and more complete webhook support. Lin Yuan sends a one-liner in Slack --- "We're going with Clerk, rationale to be documented in Confluence later" --- then moves on to his next task.
That Confluence document was never written.
Not because Lin Yuan is lazy. But because that forty-minute conversation was itself the decision process --- all the weighing, elimination, and validation had already happened. "Writing it up" into documentation is essentially asking someone to re-serialize thinking that has already been completed, in a less efficient format. The return on investment is too low, so it perpetually sits at the bottom of the priority list.
By Friday, Lin Yuan no longer remembers why Auth0 was eliminated. By next quarter, when a new engineer asks "why not Auth0," Lin Yuan can only say "pricing was an issue" --- but exactly which pricing tier, at what user scale, and compared against what alternatives, all of that is lost.
This is not Lin Yuan's personal problem. This is a systemic amnesia that every technical team deeply using AI is experiencing from 2024 to 2026.
Decision Migration: An Unnamed Paradigm Shift
Over the past thirty years of software development, the medium for technical decisions has undergone a clear evolution:
1990--2010: The Document-Driven Era. Decisions were recorded in design documents, RFCs, and mailing lists. These media were persistent, searchable, and version-controlled. An architecture decision record (ADR) written in 1998 could still be retrieved in 2008. The information decay rate was near zero.
2010--2020: The Ticket-Driven Era. Decisions scattered across Jira, Confluence, Notion, and GitHub Issues. Information became fragmented, but at least there was a promise of "traceability" --- in theory, you could find any historical decision through search. In practice, the average lifespan of a Confluence page was about 18 months before it sank into the depths of unmaintained page hierarchies. The information decay rate was low, but retrieval cost kept rising.
2020--2024: The Prelude to the Conversation-Driven Era. Slack, Teams began carrying more and more real-time decisions. But these tools at least had search functions, message history, and channel archiving. Information retention was passive --- you did not need to actively save; the platform did it for you.
2024--2026: The AI Conversation-Driven Era. This is the breaking point. When developers begin using Claude, ChatGPT, and Copilot for technical decisions, the decision medium becomes the session window --- a temporary container with a lifespan measured in hours. Session ends, container destroyed, and the entire context of the decision evaporates with it.
This shift is dangerous not because it is happening slowly, but because it is accelerating while almost no one realizes what is being lost.
Traditional knowledge management frameworks --- the DIKW pyramid (Data, Information, Knowledge, Wisdom) --- assume that knowledge is refined layer by layer from data. But in AI conversations, the path of knowledge creation is entirely different: it emerges through the interactive process of dialogue. The developer inputs constraints, the AI provides options, the developer probes edge cases, the AI revises its suggestions, the developer makes a judgment --- this ping-pong interaction is itself the knowledge generation process. The final decision is just the tip of the iceberg; beneath the surface lies the entire reasoning chain.
And existing knowledge management systems --- from Confluence to Notion to Linear --- capture only the tip of the iceberg.
19.5 Million Tokens Evaporating
Let us make a conservative estimate.
A moderate-intensity AI user --- not someone spending eight hours a day in AI conversations, just a normal senior engineer --- spends roughly 3 hours per day interacting with AI. This number seems large, but breaks down into everyday activities: morning AI-assisted review of yesterday's PR (30 minutes), mid-morning architectural design for a new feature with AI assistance (45 minutes), afternoon debugging a bizarre race condition with AI (60 minutes), evening exploring the feasibility of a new framework with AI (45 minutes).
Token consumption per hour depends on conversation density. Pure text discussion runs about 5,000 tokens/hour, but when conversations include code snippets, error stacks, and configuration files, this jumps to 10,000--15,000 tokens/hour. Taking a reasonable midpoint: 10,000 tokens/hour.
Now do the multiplication:
180 days x 3 hours/day x 10,000+ tokens/hour = 5,400,000 - 19,500,000+ tokens
Taking the upper bound --- because conversations containing code dominate real-world scenarios --- 6 months produces approximately 19.5 million tokens of decision records.
What does 19.5 million tokens mean in context?
- It is equivalent to roughly 30 technical books worth of text.
- It exceeds the context window of any existing LLM (as of early 2026, the maximum commercial window is 200K--1M tokens).
- It contains not just "information" --- it contains reasoning paths from the decision process, excluded alternatives, reasons for exclusion, and alternative approaches that were not adopted but worth remembering.
What happens to these 19.5 million tokens after the session ends? They evaporate. Completely and irreversibly.
You can find session titles in ChatGPT's history, but trying to locate that specific discussion about Auth0's pricing model among hundreds of sessions titled "Debug auth issue"? That is effectively equivalent to loss. Claude's project feature retains some context, but it is designed as working memory for the current project, not a long-term knowledge base.
The deeper problem is: these tokens are not uniformly distributed. Truly valuable decisions tend to concentrate in a handful of deep conversations --- perhaps only 5% of total token volume, but containing 80% of the critical judgments. That two-hour discussion about database selection, that forty-minute analysis of three authentication solutions, that hour spent figuring out why microservice A should not directly call microservice C --- these are irreplaceable knowledge assets.
"Irreplaceable" needs explanation. You can certainly redo a database selection analysis. But what you cannot rebuild is the snapshot of constraints at that time --- team size, budget, current tech stack, known pitfalls, resource competition from other ongoing projects. Decisions are not made in a vacuum; they are made under a specific combination of constraints at a specific moment. Losing the decision record is, in essence, losing the constraint snapshot of that moment.
A Day in the Life of Developer Chen Si
To make this problem more concrete, let us follow a fictional but typical developer --- Chen Si --- through her day of using AI.
08:30 - Architecture Discussion. Chen Si is evaluating whether to decompose her team's monolith into microservices. She has a deep conversation with Claude, discussing decomposition boundaries, inter-service communication patterns (gRPC vs REST vs message queues), and data consistency strategies. During the conversation, Claude points out a risk she had not considered: if the order service and inventory service are split, cross-service transactions will require the Saga pattern, and no one on the team currently has hands-on Saga experience. Chen Si decides to split the payments module first, since it has the simplest dependency graph.
The value of this decision lies not only in the conclusion "split payments first" but also in the reasoning for exclusion: "why not split orders first."
10:15 - Debugging Session. Intermittent timeouts appear in production. Chen Si feeds the error logs, span traces, and recent deployment diff to the AI. After three rounds of analysis, the problem is traced to a database connection pool configuration: max_idle_time is set too short, causing connections to be recycled during low-traffic periods, then requiring reconnection when traffic recovers.
This debugging process itself is organizational knowledge --- what the investigation path looks like when similar symptoms appear next time, and which hypotheses were validated and eliminated.
14:00 - Technology Selection. The team needs a frontend state management solution. Chen Si is torn between Zustand, Jotai, and Redux Toolkit. She discusses each option with the AI across dimensions of team size (5 people), application complexity (moderate), TypeScript support, and learning curve. The final choice is Zustand, for its minimal API and best compatibility with React 18's concurrent features.
This selection conversation consumed roughly 8,000 tokens. Three months later, when a new colleague asks "why not Redux," Chen Si can no longer recall the detailed comparative analysis.
16:30 - Code Review Assistance. Chen Si uses the AI to review a colleague's PR and spots a potential N+1 query problem. The AI not only identifies the issue but suggests two fix approaches, explaining why the DataLoader pattern is more appropriate than a simple JOIN in this specific scenario (because the query conditions are dynamic).
Chen Si writes in the PR comment: "N+1 issue, suggest using DataLoader." But the reasoning behind "why DataLoader is better than JOIN" remains in the AI conversation.
End of Day. Chen Si has produced roughly 30,000--40,000 tokens of decision records. She left a few brief conclusions in Slack and updated a few ticket statuses in Jira. But the real decision logic --- the why, the why-not, the conditions under which this decision would fail --- all of it is trapped in the day's AI sessions.
Tomorrow, these sessions will be buried under new conversations or sink into the abyss of chat history. Six months from now, they will become completely irretrievable ghost data.
Three-Layer Deep Analysis
Surface Layer: Phenomena
The most visible phenomenon is "can't remember." Developers frequently rediscuss the same issues in AI conversations because they cannot recall previous conclusions or reasoning processes. The project's knowledge graph becomes riddled with holes --- conclusions exist but reasons do not; decisions exist but constraints do not; solutions exist but excluded alternatives do not.
This repetition does not just waste time; more dangerously, the second discussion may reach a different conclusion under different constraints, and the developer may not even realize a contradiction exists --- because the record of the first discussion has disappeared.
Middle Layer: Causes
This problem has three structural causes:
First, the ephemerality of AI conversations. Unlike Confluence pages or git commit messages, AI conversations are not designed to be persistent media. They are working memory, not long-term memory. Asking AI conversations to serve as a knowledge base is like asking RAM to serve as a hard drive --- the architecture is fundamentally wrong.
Second, the implicit nature of the decision process. In traditional workflows, the decision process has at least one explicit recording step --- writing an RFC, filling an ADR template, updating design documents. AI conversations eliminate this step, not because it is unimportant, but because the conversation itself is the decision process --- an additional recording step feels redundant. The problem is that a conversation can be a decision process, but it is not a decision record. Separating process from record is a basic requirement of knowledge management, and AI conversations conflate the two.
Third, the impossibility of retrieval. Even if platforms retain session history (like ChatGPT), the cost of locating a specific decision among hundreds of sessions is high enough to cause abandonment. This is not a search algorithm problem --- it is a metadata problem. Sessions are not tagged with topics, types, associated projects, or associated people. They are just a chronologically ordered conversation stream. Searching for specific semantics in a chronologically ordered stream is a needle in a haystack.
Deep Layer: Impact
On the surface, this is merely "inconvenient." But on deeper analysis, it is changing the fundamental structure of team knowledge.
Organizational Amnesia. When the reasoning chain for a critical decision exists only in an individual's AI conversations, that knowledge becomes a single point of failure. A team member leaving, transferring, or even just going on vacation can cause permanent loss of critical context. Traditionally, we hedge this risk through documentation, code comments, and commit messages. But when the decision process itself migrates to AI conversations, these traditional hedging mechanisms fail --- because they only capture conclusions, not reasoning.
Decision Drift. Without the anchoring of historical reasoning chains, a team's technical decisions undergo unconscious drift. In January, Postgres was chosen because of JSONB support and PostGIS extensions; but by June, when a new module needs a database, if no one remembers January's complete reasoning, the choice might be MySQL based on different (or even contradictory) reasons. This is not a hypothetical scenario --- any engineer who has worked for more than two years has seen this drift. The evaporation of AI conversations merely shortens the drift cycle from "years" to "months."
Knowledge Debt. We are familiar with the concept of "technical debt" --- sacrificing code quality for short-term speed. The evaporation of AI conversations is creating a new form of debt: knowledge debt. Every unrecorded decision is a unit of knowledge debt. Its interest is the time and cognitive load consumed when that decision must be re-derived in the future. Moreover, unlike technical debt, knowledge debt is often invisible --- you do not know what you have lost until you urgently need it.
A New Fundamental Question
Let us distill this chapter's argument into a fundamental question:
How can the knowledge generated in AI conversations be transformed into persistent, retrievable organizational assets --- without destroying the efficiency of AI conversations?
Note the constraints in this question:
- "Without destroying efficiency" --- Any solution requiring additional manual effort from developers will fail. Requiring developers to manually summarize after every conversation is tantamount to requiring them to return to the era of writing Confluence.
- "Persistent" --- Session history does not count. It must be independent of any specific AI platform's lifecycle.
- "Retrievable" --- Storage is not the problem; retrieval is. The raw storage cost of 19.5 million tokens is negligible, but the cost of locating specific knowledge within 19.5 million tokens is astronomical.
- "Organizational assets" --- Not personal notes, but team-level shareable, transferable knowledge.
This is a real, urgent, and insufficiently addressed problem.
In the chapters that follow, we will see how existing attempts --- from Mem0 to Zep to Letta --- answer this question, and why they make a fundamental error in one key assumption. But before entering those analyses, let us first ensure a full understanding of the problem's scale: every day, millions of developers worldwide are having deep technical conversations with AI, each conversation producing irreplaceable organizational knowledge, and this knowledge is irreversibly destroyed when the session ends.
This is not a tolerable status quo. This is an engineering problem waiting to be solved.
Chapter 2: The Summary Trap
Positioning: This chapter analyzes the shared assumption of existing AI memory systems --- letting the LLM decide what is worth remembering --- and argues why this assumption is fundamentally wrong.
A Seemingly Reasonable Intuition
The previous chapter described the problem: 19.5 million tokens of decision records evaporate after sessions end. Facing this problem, a natural intuition is: let the AI extract important content and remember it.
This intuition gave rise to a wave of AI memory systems. Their core logic is nearly identical:
- Monitor the user's conversation with the AI.
- Use an LLM to extract "key information" from the conversation.
- Store the extracted results in a vector database.
- In future conversations, retrieve relevant memories and inject them into the context.
This workflow appears airtight. But it has a fatal implicit assumption: the LLM can correctly determine what constitutes "key information."
Before we dig into this assumption, let us first fairly examine several representative systems and understand their design intent and engineering value.
Three Systems, One Assumption
Mem0: Memory as Extraction
Mem0 (formerly EmbedChain) is one of the most influential projects in this space. Its design philosophy is straightforward: extract factual memories from conversations, store them in a vector store, and retrieve when needed.
Its typical workflow looks like this. A user says in a conversation: "Our team evaluated three database options last week and ultimately chose Postgres, mainly for JSONB support and PostGIS extensions. MySQL's JSON support isn't mature enough, and MongoDB's transaction model doesn't fit our use case." Mem0's extraction module compresses this into a memory entry like:
user prefers Postgres over MySQL and MongoDB
Mem0's engineering execution is solid. It has a mature API, multiple vector backend support, and enterprise-grade features. Its problem is not engineering quality but the act of extraction itself.
Zep: Graph-Enhanced Memory
Zep adds a knowledge graph (Graphiti) on top of vector retrieval. It not only stores "user prefers Postgres" but also attempts to build entity relationships: User -> prefers -> Postgres, Team -> evaluated -> database options.
This is a meaningful improvement. Graph structure makes cross-topic association possible --- you can ask "all of this user's decisions about databases," not just exact-match a specific memory. Zep's Graphiti system uses Neo4j as the graph store and supports temporal validity of facts, meaning it can distinguish "Kai was working on the Orion project in June 2025" from "Kai is no longer working on Orion as of March 2026."
However, Zep's knowledge graph is still built on top of LLM extraction. The graph's nodes and edges are extracted from conversations by an LLM. If critical context is lost at the extraction stage, no amount of elegant graph structure can compensate for that loss.
Letta (formerly MemGPT): Self-Managed Memory
Letta takes the most radical approach --- it gives the AI the ability to self-manage its memory. The AI can proactively decide what to write to core memory, what to archive to archival memory, and what to forget. This design draws inspiration from operating system virtual memory management: when the context window fills up, swap out non-urgent information to external storage.
Letta's innovation is that it turns memory management itself into an AI capability rather than an external process. The AI is no longer passively having memories extracted; it actively manages its own cognitive resources.
But this introduces a new risk dimension: is the AI's self-judgment reliable? When the AI decides a piece of information is "not important enough" to keep in core memory, what is the accuracy rate of that judgment? This question is nearly impossible to answer --- because you cannot audit information you do not know you have lost.
The Fundamental Problem with Extraction
Let us now dissect the assumption "let the LLM extract key information." Its problems can be understood at three levels.
Level One: Compression Is Lossy
This sounds obvious --- all compression is lossy (except lossless compression). But the key point is: for LLM summarization, what gets discarded is unpredictable.
When you compress a file with gzip, you know that decompression yields exactly the same content. When you compress an image with JPEG, you know that high-frequency detail is lost while low-frequency structure is preserved. But when you ask an LLM to "extract key information" from a conversation, you do not know what it will keep and what it will discard --- because this depends on the model's training distribution, the prompt wording, and the contingent combination of context.
Returning to the earlier example, the original conversation contains:
- Conclusion: Choose Postgres.
- Positive reasons: JSONB support, PostGIS extensions.
- Exclusion reasons: MySQL's JSON support is immature, MongoDB's transaction model does not fit.
- Constraints: Team size, existing tech stack, budget.
- Alternative discussion: Perhaps CockroachDB was mentioned but quickly eliminated (no one on the team has experience).
- Temporal context: This decision was made last week, at a specific project phase.
The LLM extraction might retain "user prefers Postgres." It might retain "because of JSONB." But it will almost certainly discard the following:
- Why MongoDB was excluded (transaction model unsuitable --- an implicit indication of what use cases it is suitable for)
- CockroachDB was mentioned but excluded (insufficient team experience --- an implicit indication of the team's capability boundaries)
- At what project phase this decision was made (temporal constraints)
- Who on the team participated in this discussion (accountability)
This discarded information is not "unimportant" --- it is the skeleton of the decision. The conclusion is the muscle; the reasoning chain is the skeleton. Muscle without a skeleton is a mass of flesh that cannot stand.
Level Two: Extraction Errors Are Silent
This is more serious than information loss.
Information loss means "I don't know" --- at least this is an honest state. You know that you don't know, so you will reinvestigate. But extraction error means "I know the wrong thing" --- you think you know, but your knowledge is wrong.
Consider the following scenario. Original conversation:
"We ultimately chose Postgres, but Maya actually preferred MongoDB because she had used it at the Acme project and had a good experience. After a team vote, the majority supported Postgres."
The LLM extraction might produce:
Maya prefers MongoDB based on positive experience at Acme project
Or:
Team chose Postgres; Maya had concerns
Or even:
Maya recommended Postgres based on Acme project experience
The last one is completely wrong --- it reverses Maya's position. But in future conversations, when the AI loads this memory and says "based on previous records, Maya recommended Postgres," the user may not notice the error because it sounds plausible. The conclusion itself is correct (the team chose Postgres); only the attribution is wrong.
The cost of correcting such errors is extremely high. First, you need to realize the memory is wrong --- but why would you question a seemingly plausible memory? Second, even if you discover the error, you need to locate and correct it in the memory store. For most AI memory systems, this means manually editing or deleting memory entries --- an operation that virtually no user performs.
Level Three: The Irrecoverability of the Original Record
This is the most fundamental problem.
After an LLM extracts memories from a conversation, the original conversation is typically not preserved in full. Even if a platform retains session history (like ChatGPT), it is not linked to the memory system --- you cannot trace from an extracted memory back to the original conversation that produced it.
This means extraction is a one-way door. Once through, there is no going back. You cannot audit extraction results ("what was this memory's original source?"), cannot correct errors ("the extraction was wrong, let me see the original and re-extract"), cannot supplement ("what else was discussed at that time?").
In data engineering terms: these systems discard raw data and retain only derived data. Anyone with data engineering experience knows this is an anti-pattern --- because the derivation logic may be wrong, and without raw data, you cannot re-derive.
False Memory vs. No Memory
This leads to a core question: which is more dangerous --- an AI with false memories or an AI with no memory?
The answer: false memory is more dangerous. Significantly more dangerous.
An AI with no memory is a blank slate. It starts from zero every time, requiring you to re-provide context. This is annoying but at least error-free --- it knows nothing about you, so it will not give advice based on false premises. You need to spend a few extra minutes explaining background each time, but the responses you receive are based on the (correct) information you provide in that session.
An AI with false memories is a confident source of misinformation. It "remembers" that you prefer a certain technology and adjusts all subsequent recommendations based on this memory --- even if the memory is wrong. Worse, its confidence discourages you from questioning its premises. When the AI says "based on your previous preference," you typically assume it is correct.
Consider specific scenarios:
Scenario A: Memoryless AI. You ask the AI to help you choose a database. The AI says "please tell me your requirements." You spend two minutes describing constraints. The AI gives a reasonable recommendation. Total cost: two extra minutes of context-providing.
Scenario B: False-memory AI. You ask the AI to help you choose a database. The AI says "based on your previous preference, I recommend MongoDB." But what you actually discussed before was Postgres; the AI's memory extraction got confused. You might accept the recommendation outright (because the AI seems confident), or you might spend time correcting (but you need to first realize it is wrong). Worst case: you make an inconsistent technology selection based on a false "historical preference" and do not discover the problem until months later.
This analysis is not denying the value of AI memory. Memory is extremely valuable --- it eliminates the cost of repeated explanations and makes AI a true long-term collaboration partner. But the first principle of a memory system must be "better absent than wrong." False memory is worse than no memory.
Core Mechanism: Why Extraction Inevitably Fails
Let us analyze the mechanism of extraction failure more precisely.
LLM summarization extraction is fundamentally an information compression task. Its input is a conversation (typically thousands of tokens), and its output is a set of "memories" (typically tens of tokens). The compression ratio is usually between 50:1 and 100:1.
At this compression ratio, what is retained depends on the LLM's "saliency detection" --- which information the model considers most important. An LLM's saliency detection is learned from training data; it reflects statistical regularities in the training distribution, not the importance ranking of your specific project.
For example. In an LLM's training data, "user prefers Postgres" is a high-frequency pattern --- it appears in a large number of technical discussions. Therefore, LLMs tend to extract this type of "preference statement." But "when we evaluated Postgres in Q3 2025, only Kai on the team had production-level Postgres experience" is a fact with no corresponding high-frequency pattern in the training data, so the LLM tends to discard it.
But for your team, the latter may be more important than the former --- because it reveals an execution risk: if Kai leaves, Postgres operations become a single point of failure.
The mismatch between the LLM's saliency detection and your saliency needs is the fundamental reason extraction inevitably fails. You cannot fix this through better prompts --- because the problem is not the prompt but the saliency judgment itself, which is domain-dependent, while the LLM's judgment is domain-independent.
This does not yet account for another factor: the time-varying nature of importance. Information that seems unimportant at extraction time may become critical three months later. "We considered CockroachDB but didn't use it" seems like trivia at the time --- but three months later, when your Postgres cluster hits horizontal scaling bottlenecks, you suddenly want to know: why was CockroachDB excluded? Was it for technical reasons or team experience reasons? If it was an experience issue, has anyone on the team since learned CockroachDB?
Extraction makes judgments in the "present," but memory is used in the "future." You do not know what your future self will need, and therefore you cannot make correct trade-offs in the present.
Benchmark Evidence
The analysis above is not merely theoretical reasoning. Benchmark data provides empirical support.
LongMemEval is a widely adopted AI memory evaluation benchmark that tests a system's ability to retrieve specific information from long-term conversation history. The core metric is R@5 (Recall at 5) --- the proportion of correct answers appearing in the top 5 returned results.
Here are the major published results as of now:
| System | LongMemEval R@5 | API Requirements | Cost |
|---|---|---|---|
| MemPalace (hybrid) | 100% | Optional | Free |
| Supermemory ASMR | ~99% (experimental) | Required | -- |
| MemPalace (raw) | 96.6% | None | Free |
| Mastra | 94.87% | Required (GPT) | API cost |
| Mem0 | ~85% | Required | $19-249/month |
| Zep | ~85% | Required | $25/month+ |
This data warrants careful analysis.
Mem0 and Zep's ~85% is not a low score --- it demonstrates that these systems can find the correct memory in most cases. But 85% means that out of every 20 retrievals, 3 fail. For a daily-use memory system, a 3/20 failure rate means the user may encounter 1--2 instances of "memory not found" or "memory is wrong" per day. This is sufficient to severely erode user trust in the system.
More notable is MemPalace (raw) at 96.6% --- this score was achieved without using any API, without calling any LLM. It uses only local ChromaDB vector retrieval, with no LLM reranking, no summarization extraction, no "intelligent" processing whatsoever.
This comparison reveals a deep insight: a significant portion of Mem0 and Zep's 15% failure rate may be caused by the extraction step itself. Not retrieval failure --- information was lost at the extraction stage, meaning that even if retrieval worked perfectly, the correct answer could not be found.
MemPalace achieves 96.6% without using an LLM precisely because it has no extraction step --- original conversations are preserved in full, and search operates directly on the original content. No extraction means no extraction errors. No compression means no compression loss.
The gap between 96.6% and 85% --- nearly 12 percentage points --- is enormous in the context of memory systems. It is not merely a difference in accuracy but a qualitative shift in user experience: a 96.6% accuracy system "occasionally makes mistakes," while an 85% accuracy system "frequently makes mistakes." User trust in memory systems has a nonlinear threshold --- the trust increment from 85% to 96% is far greater than from 70% to 85%.
Three-Layer Depth: Phenomena, Mechanisms, Consequences
Phenomena Layer
Users who use LLM memory systems for a period encounter three types of problems:
Omission --- "I definitely discussed this topic, but the system says there is no relevant memory." This is the result of the extraction stage judging certain content as "not important enough" and discarding it.
Distortion --- "The system remembers what I said, but the details are wrong." This is the compression loss of the extraction stage, where critical details are simplified or distorted during summarization.
Hallucination Inheritance --- "The system fabricated something I never said." This is a hallucination produced by the LLM during extraction that gets stored as memory in the system. A normal LLM hallucination ends when that conversation ends; but a hallucination stored in a memory system will continuously affect all subsequent conversations --- it becomes a "fact."
Each of these problems is tolerable in a single occurrence. But they are cumulative. Errors in a memory system do not self-correct --- an incorrect memory will remain in the system, continuously affecting subsequent interactions, until the user proactively discovers and deletes it. And the probability of users proactively auditing their AI memory store is near zero.
Mechanism Layer
The mechanism behind these phenomena can be distilled into a single model: an unauditable, one-way transformation.
Original conversation --[LLM extraction]--> Memory entry --[storage]--> Vector database
| |
| (raw data discarded) | (cannot trace back to original source)
v v
Irrecoverable Unauditable
This flow has two fatal nodes:
Node 1: Extraction is irreversible. Once the original conversation is "extracted" into memory entries, the original conversation itself is no longer part of the system. This differs from a database materialized view, which can be regenerated from the base table. In AI memory systems, once the "view" (extracted memory) is generated, the "base table" (conversation) is discarded.
Node 2: Extraction is unauditable. Users cannot know from which conversation or which passage a memory was extracted. Therefore, they cannot verify whether the extraction is correct. In a system without audit capability, errors accumulate silently.
The combination of these two nodes creates a vicious cycle: errors are introduced (because extraction is imperfect), errors go undetected (because they are unauditable), errors persistently influence (because memories are persistent), and more errors are introduced (because subsequent conversations are based on erroneous memories).
Consequences Layer
In the long term, memory systems based on LLM extraction evolve into untrustworthy systems. Not because their engineering quality is poor, but because their core mechanism --- letting the LLM decide what is important --- inevitably produces unacceptable error rates over long time horizons.
This consequence has structural implications for the entire AI memory space:
Trust Deficit. After encountering a few false memories, users begin to distrust the entire system. Once trust collapses, even if the system is correct 85% of the time, users will habitually ignore its output. This degrades the memory system from "useful tool" to "noise source requiring verification" --- and the cost of verifying a memory may exceed the cost of not using a memory system at all.
Cold Start Dilemma. New users have the worst experience when the system has not yet accumulated enough memories (because the system remembers nothing), but also have a poor experience after the system has accumulated many memories (because false memories keep growing). This creates a narrow window --- when the system has some memories but not yet too many errors --- where the experience is best. Over time, users inevitably slide out of this window.
Industry Misdirection. The greater risk is: if the entire industry continues down the "LLM extraction" path, then "AI memory is unreliable" could become a widely accepted conclusion --- not because the concept of AI memory is flawed, but because the current implementation path has a fundamental defect. Just as early VR headsets led many to conclude "VR doesn't work," a wrong implementation can kill a right idea.
The Right Question
This chapter's analysis can be distilled into a single judgment:
Letting the LLM decide what is worth remembering is the wrong answer to the "AI memory" question.
So what is the right answer?
The right question is not "what is worth remembering" (this requires predicting future needs, which is impossible), but "how to make retrieval efficient while preserving everything."
In other words: storage is not the bottleneck; retrieval is.
If you can store all original conversations (at nearly zero cost) and can quickly, accurately find relevant content when needed (this is the real engineering challenge), then you do not need anyone --- neither an LLM nor a human --- to make judgments about "what is important and what is not."
This shift --- from "intelligent extraction" to "complete storage + structured retrieval" --- is the subject of the next chapter. There, we will use concrete numbers to demonstrate: the cost of complete storage is surprisingly low, and the room for improvement in retrieval efficiency is large enough to warrant rethinking the entire architecture.
Chapter 3: The Economics of Verbatim Storage
Positioning: This chapter uses hard data to prove that "store everything" is entirely feasible economically, shifting the focal point from "whether to store" to "how to organize," setting the stage for the solution introduced in subsequent chapters.
An Inverted Equation
In the previous two chapters, we established two arguments:
- Millions of tokens' worth of decision records evaporate daily in AI conversations (Chapter 1).
- Letting the LLM decide what is worth remembering is fundamentally wrong (Chapter 2).
The logical intersection of these two arguments points to a seemingly radical conclusion: store everything. No extraction, no compression, no filtering --- preserve every token of every conversation verbatim.
Most people's first reaction to this approach is: "that's too expensive" or "that's not realistic."
This reaction is wrong. It comes from an inverted equation: people overestimate the cost of storage and underestimate the difficulty of retrieval.
Let the numbers speak.
Storage Cost: Approaching Zero
First, let us establish a baseline. In Chapter 1, we estimated the data volume a moderate-intensity AI user generates over 6 months:
180 days x 3 hours/day x 10,000 tokens/hour = approximately 19.5 million tokens
What does 19.5 million tokens convert to in raw text?
One token is approximately 4 characters (English) or 1.5 characters (mixed Chinese-English scenarios). Using a conservative estimate, 19.5 million tokens equals roughly 60 million characters, or 60MB of plain text. Adding metadata (timestamps, session IDs, project tags) brings it to about 100MB.
100MB. Six months of all AI conversation records. 100MB.
What does this number mean in 2026?
- Local storage: A 1TB SSD costs roughly $50. 100MB occupies 0.01%. A single photo on your phone might be larger.
- Cloud storage: AWS S3 standard storage costs $0.023/GB/month. The annual storage cost for 100MB is $0.028 --- less than 3 cents.
Storage cost, in any reasonable discussion, can be rounded to zero.
But this is only raw text storage. AI memory systems typically also need vector indexes to support semantic retrieval. Vector index storage overhead is roughly 3--5x the raw text (depending on vector dimensions and index structure). Even so, a 500MB vector database runs perfectly fine locally --- ChromaDB at this scale has query latencies in the millisecond range.
So the first conclusion is clear: "can't afford to store" is a non-problem. The cost of completely storing all AI conversations is negligible.
The Real Cost: Not in Storage, but in Usage
Storage is cheap, but usage is not. Specifically, when you need to load memories into an LLM's context window, every token has a cost --- because LLM APIs charge by the token.
This is the real economic decision point. Let us compare the usage costs of four approaches.
Approach One: Full Paste
The most naive approach: stuff all historical conversations into the LLM's context at once.
Token consumption: 19,500,000 tokens
This approach is physically impossible. As of early 2026, the maximum commercial LLM context window is approximately 200K--1M tokens. 19.5 million tokens exceeds any existing model's context window. Even if future context windows expand to tens of millions, the loading cost would be astronomical.
Conclusion: Not possible.
Approach Two: LLM Summary
This is the approach adopted by Mem0, Zep, and similar systems. Use an LLM to compress 19.5 million tokens of original conversations into summaries, and load summaries when needed.
Assuming a 30:1 compression ratio (already quite aggressive), the total summary volume is approximately 650,000 tokens.
How many summary tokens need to be loaded per session? This depends on the scenario, but a reasonable estimate is roughly 5,000--10,000 tokens of relevant memory per session.
Per-session loading: ~7,500 tokens (summaries)
Daily sessions: ~5
Annual token consumption: 7,500 x 5 x 365 = 13,687,500 tokens
At Claude Sonnet's input price of $3/million tokens (early 2026 reference price):
Annual cost = 13,687,500 x $3 / 1,000,000 = ~$41
But this is only the loading cost. Generating the summaries themselves also requires LLM calls --- you need the LLM to process 19.5 million tokens of original conversations to produce summaries. At input pricing:
Summary generation cost = 19,500,000 x $3 / 1,000,000 = $58.5 (one-time)
Adding ongoing incremental updates (new daily conversations needing summarization), the actual annual total cost is roughly $200--500, depending on usage intensity and model choice.
Taking a middle estimate: ~$507/year.
This cost is not high --- entirely acceptable for professional users. But the problem is not cost but quality: as argued in the previous chapter, these summaries have an accuracy rate of approximately 85%. You are spending $507 per year on a memory system that is 85% accurate.
Approach Three: MemPalace Wake-Up
MemPalace's design employs an entirely different strategy. Instead of loading large quantities of memory summaries at each session, it loads a very small "identity layer" --- who you are, who your team is, what your projects are --- and then performs precise retrieval only when needed.
By the current source baseline, wake-up loads L0 (identity) and L1 (key story / key facts), typically totaling ~600-900 tokens. The more aggressive ~170 token / ~$0.70 figure often cited in the README corresponds to a later target path: rewrite L1 into AAAK and use that smaller representation for wake-up.
Per-session loading: ~600-900 tokens
Daily sessions: ~5
Annual token consumption: 600-900 x 5 x 365 = 1,095,000-1,642,500 tokens
Annual cost = ~$3.3-$4.9
Still only single-digit dollars. ~$3-$5/year. The README's $0.70 figure belongs to the future AAAK wake-up path; for the current default CLI, the correct order of magnitude is "a few dollars," not "a few hundred dollars."
$5. Not $500, not $50. Single-digit dollars per year.
To be fair, though, 600-900 tokens still contain only your identity and the most important story layer --- not every specific historical decision. When you need to look up "why we chose Postgres," you still need retrieval.
Approach Four: MemPalace Wake-Up + On-Demand Retrieval
In actual use, MemPalace's workflow is: first load 600-900 tokens of wake-up information, then perform semantic retrieval as the conversation requires. Each retrieval returns approximately 2,500 tokens of relevant content (including original conversation fragments).
Assuming an average of 5 retrievals per session:
Per-session loading: 600-900 + (5 x 2,500) = 13,100-13,400 tokens
Daily sessions: ~5
Annual token consumption: 23,907,500-24,455,000 tokens
Annual cost = ~$72-$73
Wait --- this number looks higher than the summary approach? Let us correct the calculation.
In practice, 5 retrievals/session is a high estimate. Most sessions require only 0--2 retrievals --- memory queries are needed only when the conversation involves historical decisions. A more realistic estimate is an average of 1 retrieval per session:
Per-session loading: 600-900 + (1 x 2,500) = 3,100-3,400 tokens
Daily sessions: ~5
Annual token consumption: 5,657,500-6,205,000 tokens
Annual cost = ~$17-$19
The lower numbers quoted in the README still correspond to the future AAAK wake-up path. The key point is not the exact figure but the order of magnitude: under the current default implementation, MemPalace's annual usage cost is roughly $3-$20 in common use (0-1 retrieval per session) and $70+ in heavier use, while the summary approach sits in the $200-$500 range.
Cost Comparison Summary Table
| Approach | Per-load tokens | Annual cost | Accuracy |
|---|---|---|---|
| Full paste | 19,500,000 (exceeds context window) | Not possible | N/A |
| LLM summary | ~7,500 | ~$507/year | ~85% |
| MemPalace wake-up | ~600-900 | ~$3-$5/year | N/A (identity layer only) |
| MemPalace + on-demand retrieval | ~3,100--13,400 | ~$17-$73/year | 96.6% |
The last column is key: MemPalace is not only 50x cheaper but also 12 percentage points more accurate. This is not a "cheaper but slightly worse" approach --- it is superior on both dimensions simultaneously.
Why Retrieval Is Hard
At this point, you might think: if storage is cheap and usage cost is low, is the problem solved?
No. The calculations above have an implicit assumption: retrieval must be accurate. If retrieval returns content other than what you are looking for, no amount of low cost matters --- you have spent tokens loading irrelevant information.
The difficulty of retrieval comes from three levels.
Semantic Gap
There is a semantic gap between the user's query and the memory's content.
The user asks: "Why did we choose Clerk?" The original conversation's phrasing might be: "OAuth provider evaluation conclusion --- Auth0's enterprise pricing triples after 10K MAU, Clerk's pricing is more linear, and the Next.js SDK works out of the box."
"Clerk" appears on both sides, but the semantic correspondence between "chose" and "evaluation conclusion," and the causal correspondence between "why" and the pricing/SDK comparison, both require semantic understanding to establish.
Simple keyword matching would miss this correspondence. Vector retrieval (semantic similarity) can capture part of it, but in large memory stores, semantically similar but irrelevant results (false positives) increase significantly.
Scale Dilemma
A memory store of 19.5 million tokens, segmented into conversation fragments, yields tens of thousands to hundreds of thousands of document chunks. At this scale, the probability of irrelevant content appearing in the top-5 or top-10 vector retrieval results is high.
This is a classic information retrieval problem: as the corpus grows, maintaining both high precision and high recall simultaneously becomes difficult. Increasing recall (not missing relevant results) decreases precision (admitting more irrelevant results), and vice versa.
Missing Context
Pure vector retrieval lacks structural context. It knows "this document chunk is semantically similar to the query" but does not know "this document chunk belongs to which project, involves which people, or was produced at what project phase."
Without this structural context, the retrieval system cannot answer queries like:
- "Kai's advice about databases" --- needs to know who Kai is and which conversations involved Kai.
- "Driftwood project decisions last month" --- needs to know which project Driftwood is and requires time filtering.
- "Pitfalls we hit during the auth migration" --- needs to know that the auth migration is a topic spanning multiple conversations.
What Makes Retrieval Feasible
The three retrieval difficulties --- semantic gap, scale dilemma, missing context --- are not unsolvable. Each has known solutions; these solutions simply need to be combined.
Solution One: Shrink the Search Space
The most effective counter to the scale dilemma is not a better search algorithm but a smaller search space.
If you can know approximately which region the answer is in before searching, you can shrink the search scope from tens of thousands of document chunks to a few hundred. Both precision and recall are easy to maintain at high levels on a small corpus.
But "knowing which region the answer is in" requires the memory store to have structure. Not a flat vector database, but an organizational system with hierarchy, classification, and associations.
This observation itself is not novel --- library science has studied it for centuries. The core insight of the Dewey Decimal Classification is: classify first, then search. Classification reduces an O(N) search problem to an O(N/K) search problem, where K is the number of categories.
For AI memory, the classification dimensions can be:
- Who --- Whose memories?
- What project --- Which project does it belong to?
- What type --- Is it a decision, an event, a preference, or a recommendation?
- What topic --- Specifically about auth, databases, deployment, or frontend?
If a query can be routed to the correct classification combination (e.g., "Driftwood project database decisions"), the search space can shrink from tens of thousands to tens --- dramatically improving both precision and recall.
The data quantifies this effect. On a test set of 22,000+ memories, simply narrowing the search scope step by step by "who," "type," and "topic," R@10 improved from 60.9% to 94.8% --- structure alone produced a 34-percentage-point retrieval improvement, requiring no better vector model, no LLM reranking, no additional computational cost. Purely a change in how data is organized (see Chapter 7 for the layer-by-layer benchmark data).
This is a profound result. It demonstrates that the greatest lever for retrieval efficiency is not at the algorithm level (better embeddings, more precise similarity calculations) but at the data organization level --- how information is placed into the right drawers.
Solution Two: Layered Loading
Not all memories need to be loaded in every session. Human memory is layered too --- your name, your job, your team are "always online" memories; what you had for lunch last Tuesday is an "on-demand retrieval" memory.
AI memory can adopt the same layered strategy:
| Layer | Content | Size | Loading Timing |
|---|---|---|---|
| L0 | Identity --- who this AI is | ~100 tokens | Always loaded |
| L1 | Key story / key facts --- high-weight and recent memories | ~500-800 tokens | Always loaded |
| L2 | Topic memory --- room-scoped recall | On-demand | On explicit recall / lightweight filtering |
| L3 | Deep retrieval --- full semantic search | On-demand | When explicitly needed |
In the current implementation, L0 + L1 totals roughly 600-900 tokens. The README's more aggressive 170-token figure belongs to a later stage in which L1 is fully AAAK-ified. Even at the current size, though, this is still a relatively cheap persistent context layer, and more importantly it tells the AI who you are, what your team structure is, and what project you are working on so later retrieval can start from the right frame.
Solution Three: Compression, Not Summarization
A critical distinction must be made here: compression and summarization are not the same thing.
Summarization is lossy --- it discards "unimportant" information (but who defines "unimportant"?). Compression is ideally lossless (or near-lossless) --- it aims to preserve the same factual assertions in a more compact representation.
Lossless text compression is generally believed to have limits --- natural language has finite redundancy. But what if the compression target is not for humans to read, but for AI to read?
AI and humans process text differently. Humans need complete grammar, punctuation, and connectives to understand meaning. AI can recover full semantics from highly compressed structured text.
An AI-oriented compression dialect can, in principle, achieve very high compression while preserving the factual structure of a short, structured example. For example:
Original (~1000 tokens):
Priya manages the Driftwood team. Team members include: Kai (backend dev, 3 years experience),
Soren (frontend dev), Maya (infrastructure), Leo (junior dev, joined last month).
They are developing a SaaS analytics platform. The current sprint's task is migrating
the authentication system to Clerk. Kai recommended Clerk over Auth0 based on pricing
and developer experience.
Compressed (~120 tokens):
TEAM: PRI(lead) | KAI(backend,3yr) SOR(frontend) MAY(infra) LEO(junior,new)
PROJ: DRIFTWOOD(saas.analytics) | SPRINT: auth.migration→clerk
DECISION: KAI.rec:clerk>auth0(pricing+dx)
In this example, the factual assertions are preserved and the representation is about 8x shorter. But the current open-source dialect.compress() path should not be confused with a universal strict-lossless guarantee: its real behavior includes key-sentence extraction, topic selection, and truncation of entities / emotions / flags. In the current repository, AAAK functions more like a high-compression index layered on top of raw Drawer storage than like the only copy of the memory.
The ideal difference between AAAK-style compression and LLM summarization is: the former tries to preserve factual structure while changing representation, whereas the latter decides what to keep and what to discard. In the current implementation, the raw text remains preserved in Drawers, and the AAAK-like layer is best read as a compact navigational representation rather than a perfect substitute for the original.
The Inverted Equation, Corrected
Let us return to the equation from the beginning of this chapter.
Traditional thinking: Storage is expensive, so compression (via summarization) is needed to reduce storage and usage costs.
Reality: Storage is free, usage cost depends on retrieval efficiency, and retrieval efficiency depends on how data is organized.
Correcting this equation means:
-
Do not optimize at the storage end. The cost of storing all raw data is near zero. Any "optimization" done at the storage end (such as summarization extraction) merely adds risk (information loss) without reducing cost.
-
Optimize at the retrieval end. The real cost savings come from loading fewer but more accurate tokens into the LLM context. Today's 600-900-token wake-up versus 7,500 tokens of summaries already demonstrates this advantage; if the README's AAAK wake-up path is fully connected later, that gap grows even further.
-
Invest at the organization end. The 34% retrieval improvement comes from how data is organized --- this is the highest-ROI component of the entire approach. Good data organization can make simple retrieval algorithms match the performance of complex algorithms on unorganized data.
Transition: From "Why" to "How"
Three chapters in, we have completed the full description of the problem space:
- Chapter 1 answered "what is happening" --- technical decisions have migrated into AI conversations, producing large quantities of irreplaceable knowledge assets daily that evaporate after sessions end.
- Chapter 2 answered "why existing approaches don't work" --- the assumption of having LLMs extract key information is fundamentally wrong; the false memories it produces are more dangerous than no memory at all.
- Chapter 3 answered "what is the right direction" --- store everything (at near-zero cost), then make retrieval efficient through data organization (34% improvement from structure).
The reader should now have a clear understanding of the following points:
- The problem is real and urgent.
- The mainstream existing approach (LLM extraction) has a fundamental flaw in its core assumption.
- The right direction is "complete storage + structured retrieval," not "intelligent extraction + flat storage."
- Storage is not the bottleneck; organizational method is the key.
But we have not yet answered the specific "how":
- What data organization structure produces that 34% retrieval improvement?
- What exactly does the current ~600-900-token wake-up contain, and what would the README's ~170-token AAAK path change?
- How does the on-demand semantic search work?
- How is 30x lossless compression achieved?
These questions will be addressed one by one in Part 2 of this book --- the solution space. There, we will see how an ancient memory technique --- the memory palace --- was reinvented as a knowledge architecture for the AI era.
Chapter 4: Method of Loci
Positioning: From the ruins of an ancient Greek banquet hall to the vector spaces of large language models --- why spatial structure is effective for information retrieval, and how this twenty-five-hundred-year-old insight became MemPalace's theoretical cornerstone.
The Banquet Hall Collapses
One day in the fifth century BCE, the Greek poet Simonides of Ceos attended a banquet hosted by the Thessalian nobleman Scopas. Following the custom of the time, Simonides recited a laudatory poem at the feast praising his host's achievements, but he also included praise of the twin gods Castor and Pollux. Scopas was displeased and paid him only half his fee, telling him to seek the other half from those two gods.
Midway through the banquet, a servant brought word that two young men were asking for Simonides at the door. He rose and left the banquet hall, walked outside, and found no one. At the very moment he stood outside, the roof of the banquet hall behind him collapsed. All the guests were buried under the rubble, their bodies crushed beyond recognition, and their families could not identify the dead.
Simonides discovered he could help identify the bodies --- because he remembered where each guest had been seated. He recalled not a list of names but a space: who sat at the north end of the long table, who was near the entrance, who was to Scopas's left. By "walking through" the banquet hall in his mind, he identified the dead one by one.
This story comes from Cicero's De Oratore and also appears in Quintilian's Institutio Oratoria. It has been regarded by posterity as the origin narrative of the "Method of Loci" --- the "memory palace" technique. From this experience, Simonides distilled a principle: human memory for spatial locations is far more reliable than memory for sequential information.
This principle has been repeatedly validated, forgotten, rediscovered, and validated again over the following twenty-five hundred years. The most recent "rediscovery" occurred in late 2025 --- when a systems engineer trained in Classics applied it to an AI memory system and produced a quantifiable 34% retrieval precision improvement.
How the Method of Loci Works
The operational steps of the Method of Loci are described with remarkable consistency across classical literature. Cicero, Quintilian, and later medieval rhetoricians gave nearly identical instructions:
Step One: Choose a building you know well. It must be a space you can clearly walk through in your mind --- your home, your school, a street you frequent. The key is that the structure of this space must be automatic for you; you do not need to think about "what is the next room" --- you simply walk there.
Step Two: Select a number of fixed locations (loci) within the building. These locations must be specific, visually distinctive, and in a stable order. The vase by the door, the fireplace in the living room, the windowsill in the study. Each location serves as a "hook" on which to attach information you need to remember.
Step Three: Transform the information to be memorized into vivid images and place them at each location. You do not simply place the concept "Aristotle's four causes" in the study; instead, you place a scene on the study's desk --- perhaps Aristotle himself sitting there sculpting a block of marble (formal cause acting on material cause). The more bizarre, emotionally charged, and interactive with the location the image is, the more durable the memory.
Step Four: When recalling, mentally walk through the building along the route. As you pass each location, the image placed there surfaces automatically. You do not need to search, do not need to recall "what comes next" --- the spatial route itself is the retrieval index.
This method's effectiveness for competitive memory athletes is undisputed. Data from the 2017 World Memory Championship shows that top-ranking competitors nearly all use some variant of the Method of Loci. More importantly, there is laboratory validation from cognitive science.
Cognitive Science Validation
The Method of Loci is not a purely anecdotal tradition. Cognitive psychology and neuroscience research over the past three decades has subjected it to systematic empirical testing.
In 2017, Dresler et al. published a landmark study in Neuron. They compared the brain structures and functions of World Memory Championship competitors with those of ordinary people and found that memory champions' brains showed no significant anatomical differences from ordinary ones --- they did not possess larger hippocampi or denser neural connections. The real difference lay in functional connectivity patterns: when memory champions used the Method of Loci, their brain activation patterns exhibited a high degree of coordination between the spatial navigation network and the memory encoding network.
Even more striking were the training effects. The researchers had ordinary people undergo six weeks of Method of Loci training, 30 minutes per day. After training, these ordinary people's memory performance approached that of competitive athletes, and their brain functional connectivity patterns also shifted toward the athletes' patterns. This means the Method of Loci is not a talent but a learnable cognitive technique --- it enhances memory encoding by activating the brain's spatial navigation system.
Why is spatial memory so special? Evolutionary psychology offers a plausible explanation. For the vast majority of human evolutionary history, remembering "what is where" was a basic survival capability --- the location of food sources, the direction to water, dangerous areas. This spatial memory ability, shaped by millions of years of natural selection pressure, is deeply wired into the brain's hardware architecture. Memorizing a string of abstract sequential information --- phone numbers, shopping lists, speech points --- is a demand that arose only in recent civilization, and the brain has not evolved specialized hardware support for it.
The brilliance of the Method of Loci is that it converts a task the brain is poor at (memorizing sequential information) into a task the brain excels at (memorizing spatial locations). It does not fight against how the brain works; it works with it.
O'Keefe and the Mosers were awarded the 2014 Nobel Prize in Physiology or Medicine for discovering "place cells" and "grid cells" in the brain. These cells constitute the brain's internal GPS --- a precise, automatically running spatial positioning system. The Method of Loci works precisely because it conscripts this pre-existing, highly optimized neural system to assist general-purpose memory.
You do not need to build new memory infrastructure. You just need to place information into infrastructure that already exists.
Not a Myth: Three Levels of Evidence
The Method of Loci is occasionally dismissed as "pop psychology" or a "memory trick," and sometimes even questioned as myth. This skepticism deserves a serious response, because MemPalace's entire design premise rests on the Method of Loci's effectiveness.
The evidence can be examined at three levels:
Literary evidence: Cicero's De Oratore (55 BC) and Quintilian's Institutio Oratoria (95 AD) independently document the same technique. This is not a single source — two rhetoricians from different eras describe identical operational methods, and medieval rhetorical textbooks continued the tradition unbroken for fifteen hundred years. The Simonides banquet story itself may have been embellished, but the Method of Loci as an operational technique is supported by a continuous teaching tradition spanning over a millennium.
Neuroscience evidence: As discussed above, Dresler's 2017 Neuron study provides fMRI-level evidence — the Method of Loci activates brain regions that overlap with spatial navigation areas, and the effect is reproducible through training. O'Keefe and the Mosers' discovery of place cells and grid cells (2014 Nobel Prize) provides the anatomical foundation — the brain genuinely possesses dedicated spatial encoding hardware, and the Method of Loci conscripts precisely this hardware.
Competitive evidence: The World Memory Championships, held annually since 1991, consistently sees top-ranked competitors using the Method of Loci or its variants. This is not a talent-selection effect — Dresler's research demonstrates that ordinary people can approach competitor-level performance after just six weeks of training. The Method of Loci is a teachable, learnable, verifiable cognitive technique.
These three levels of evidence are independent and mutually reinforcing. The effectiveness of the Method of Loci does not depend on the historical accuracy of the Simonides story — even if that story were entirely fictional, the neuroscience and competitive data would still stand.
Ben's Classical Education
With the cognitive science foundation of the Method of Loci understood, Ben Sigman's academic background is no longer a biographical curiosity but a key clue for understanding MemPalace's design logic.
Ben earned a Classics degree from UCLA. The core curriculum of Classics includes reading original texts in ancient Greek and Latin, ancient rhetorical theory, and ancient philosophy. Rhetoric --- particularly the rhetorical tradition of Cicero and Quintilian --- is the academic homeland of the Method of Loci. Someone trained in Classics does not treat the "memory palace" as a metaphor or a pop psychology concept. For him, it is a cognitive technique with over two thousand years of practical validation and a complete theoretical foundation.
This distinction is crucial. If "memory palace" is merely a metaphor --- meaning "organize information neatly" --- then any hierarchical file system could claim to be a "memory palace." But the Method of Loci is not about "neatness." It is about spatial structure as a retrieval index. Each location (locus) is not a label but a coordinate. You are not searching for information within labels; you are encountering information as you walk through space.
This distinction has very concrete expression in MemPalace's design. The core idea is not "throw all memories into one giant vector pool and search blindly," but "use spatialized semantic structure to shrink the candidate set before heavy retrieval." Wing, Hall, and Room express a topological intuition: you are not searching a flat label list, you are progressively approaching a target inside a partitioned building. In the current open-source implementation, that prior shows up most directly as explicit wing / room filtering and room-based aggregation; the richer Hall narrative is closer to design language than to every step of the default runtime.
Ben did not design MemPalace starting from software engineering's "partitioning" concept. He started from rhetoric's "loci" concept. These two paths may produce similar-looking structures on the surface, but in the details of design decisions, they lead to very different choices. A database partitioning scheme pursues uniform data distribution and optimal query plans; a memory palace pursues semantic coherence and cognitive naturalness --- even if this means some "rooms" are much larger than others.
From Human Memory to AI Memory
Here a question arises that needs a serious answer: the Method of Loci is effective for the human brain, but that does not mean it is effective for AI systems. The human brain has place cells and grid cells, with spatial navigation hardware optimized by evolution. Vector databases have none of these. So on what basis can a technique that depends on the brain's spatial hardware succeed when applied to AI memory systems?
The answer is: the Method of Loci's efficacy comes not only from the spatial hardware itself but from the prior constraints that spatial structure provides.
Consider a memory system with no structure whatsoever. You have 22,000 memories (the actual data scale used in MemPalace's benchmark tests), stored in a vector database. When you search for a query, the system must find the most relevant entries among 22,000 vectors. This search depends entirely on cosine distance between vectors.
The problem is that in high-dimensional vector spaces, the discriminative power of cosine distance degrades as dimensionality increases --- the so-called "curse of dimensionality." When embedding dimensions reach 384 (the default for all-MiniLM-L6-v2), many semantically different texts have very small distance differences between them. The distance difference between the first-ranked and tenth-ranked results might be only 0.02. At this precision level, the distance difference between a "nearly correct" result and a perfectly correct one may be drowned out by noise.
Now consider a structured memory system. The same 22,000 memories are organized into 8 Wings, each containing several Rooms. Whether that structural prior is provided explicitly by the user or approximated by a higher-level router, as long as retrieval begins inside a semantic subspace, the candidate set shrinks sharply. Assuming the target Wing contains 2,750 memories (22,000 / 8), the search space shrinks to 1/8 of the original.
But the key is not that the search space shrank --- a random 8-way partition could do the same. The key is that the structure is semantically coherent. Memories within the same Wing are semantically related to each other, while memories in different Wings are semantically relatively orthogonal. This means that when performing vector retrieval within a Wing, interference items (semantically similar but irrelevant documents) are greatly reduced. You no longer need to distinguish tiny distance differences among 22,000 points --- you only need to discriminate among 2,750 semantically related points, and in this subspace, the distance gap between correct and incorrect results is significantly amplified.
This is the equivalent of the Method of Loci in AI systems: spatial structure does not help "remember" information (information is already stored in the vector database) but helps "find" information. It reduces the difficulty of the retrieval task by introducing an organizational dimension orthogonal to the content.
In the human brain, this orthogonal dimension is physical space (rooms in a building). In MemPalace, this orthogonal dimension is semantic topology (the Wing/Hall/Room hierarchical structure). The underlying mechanisms differ, but the information-theoretic effect is the same: structure serves as a prior, reducing uncertainty in the retrieval process.
Three Levels of Depth: From Metaphor to Mechanism to Engineering
The concept of "memory palace" can be understood at three levels.
Level One: Metaphor. The shallowest understanding treats it as a name --- "our system is called a memory palace because it organizes information to look like a building." This level has no substantive content. Any tree structure could be called a "palace."
Level Two: Cognitive Principle. A deeper understanding recognizes that the Method of Loci reveals a universal principle about memory and retrieval: spatial structure reduces retrieval cost. This principle does not depend on the special hardware of the human brain --- it is an information-theoretic insight. Whenever a retrieval system faces the problem of "finding the right answer among many candidates," introducing an orthogonal organizational dimension reduces the difficulty of that problem.
Level Three: Engineering Constraints. The deepest understanding translates the Method of Loci's principle into concrete engineering constraints: Wing boundaries must be semantic boundaries, not arbitrary partitions. Hall classifications should be cognitive categories rather than mere database indexes. Room names should be human-understandable concept nodes, not hash values. Tunnels should be the natural emergence of the same concept across different domains, not artificially defined links.
The progressive relationship among these three levels explains why other systems have not achieved the same thing. Most AI memory systems stay at Level One --- they may also have "partitions" or "categories," but these are designed for database performance, not retrieval precision. MemPalace's design proceeds from Level Two (cognitive principle -> information-theoretic insight), lands at Level Three (concrete engineering constraints), and then uses Level One's metaphor (Wing/Hall/Room) to name those constraints.
Different order, different result.
From Concept to Code
This chapter has not discussed any source code --- this is intentional. The core insight of the Method of Loci is an implementation-independent principle: spatial (or spatial-like) structure, serving as a retrieval prior, can significantly reduce the difficulty of retrieval tasks. This principle is realized in the human brain through place cells and grid cells, and in MemPalace through Wing/Hall/Room hierarchical metadata, but the principle itself is more fundamental than any single implementation.
The next chapter will enter the implementation layer: what the five tiers --- Wing, Hall, Room, Closet, Drawer --- each are, why they are designed this way, and what trade-offs each tier's design involves. If this chapter answered "why spatial structure works," then the next chapter answers "how MemPalace turns spatial structure into engineering reality."
Here is a key design decision worth previewing: MemPalace's five-tier structure was not designed top-down ("we need a five-tier architecture") but reverse-engineered from retrieval requirements. Each tier corresponds to a retrieval failure mode: without Wings, semantic noise from different domains interferes with one another; without Rooms, concept boundaries within a theme blur; Hall and Closet express the system's intended cognitive organization even when they do not yet always appear as hard metadata or default query steps in the current open-source runtime. Each tier is an answer to a real problem, not a pre-planned architectural layer.
Simonides discovered in the fifth century BCE that human spatial memory can be conscripted to enhance sequential memory. Twenty-five hundred years later, the same principle has been revalidated in an AI system through a means Simonides could not have foreseen --- vector database metadata filtering. This is not because AI and the human brain work the same way, but because the retrieval problems both face are isomorphic at the information-theoretic level.
The memory palace is not a metaphor. It is a methodology.
Chapter 5: Wing / Hall / Room / Closet / Drawer
Positioning: The design motivations, implementation details, and engineering trade-offs of MemPalace's five-tier structure --- understanding from the source code why each tier exists and why it takes this shape.
Five Tiers, No More, No Less
The previous chapter established the core argument: spatial structure as a retrieval prior can significantly improve information retrieval precision. But "spatial structure" is an abstract concept --- it could be two tiers (partition/document), ten tiers (deeply nested classification hierarchies), or entirely flat (a large vector space with metadata tags). MemPalace chose five tiers. This choice is not arbitrary.
The five-tier structure is: Wing -> Hall -> Room -> Closet -> Drawer. Each tier corresponds to a different granularity of semantic partitioning, and each tier addresses a different retrieval failure mode.
Before entering the tier-by-tier analysis, here is the overall architecture:
graph TD
P[Palace] --> W1[Wing: wing_kai]
P --> W2[Wing: wing_driftwood]
P --> W3[Wing: wing_priya]
W1 --> H1[Hall: hall_facts]
W1 --> H2[Hall: hall_events]
W1 --> H3[Hall: hall_discoveries]
W1 --> H4[Hall: hall_preferences]
W1 --> H5[Hall: hall_advice]
H2 --> R1[Room: auth-migration]
H2 --> R2[Room: oauth-debugging]
R1 --> C1[Closet: compressed summary]
C1 --> D1[Drawer: original conversation text]
C1 --> D2[Drawer: original conversation text]
This diagram shows the complete path from Palace to Drawer. A query like "what did Kai do on the auth migration last week" navigates along this path: first entering wing_kai (scoping to person), then entering hall_events (scoping memory type to events), then reaching auth-migration (scoping to the specific concept), and finally finding the Drawers containing original text within the Closet.
Every step does the same thing: narrow the search space while maintaining semantic coherence.
Tier One: Wing --- Semantic Boundaries
What a Wing Is
Wing is the coarsest-grained organizational unit. Each person, project, or subject area has its own Wing. In MemPalace's AAAK specification, predefined Wings include:
wing_user, wing_agent, wing_team, wing_code,
wing_myproject, wing_hardware, wing_ue5, wing_ai_research
(mcp_server.py:112)
But Wings are not limited to these predefined names. The configuration system in config.py allows users to define arbitrary Wings:
DEFAULT_TOPIC_WINGS = [
"emotions", "consciousness", "memory",
"technical", "identity", "family", "creative",
]
(config.py:14-22)
At the search layer, a Wing is a ChromaDB where filter condition. When you specify wing="wing_kai" for a search, searcher.py constructs the following filter:
where = {"wing": wing}
(searcher.py:33)
This means vector retrieval is performed only among documents belonging to wing_kai --- documents from other Wings do not participate in distance calculations at all.
Why It Is Designed This Way
Wing's design motivation is to solve the cross-domain semantic interference problem.
Consider a concrete scenario. You discussed "auth migration to Clerk" in one project (Driftwood) and also discussed auth-related topics in another project (Orion). Without Wing separation, searching "why did we choose Clerk" might return results from both projects --- because in vector space, two discussions about auth are indeed semantically close. But your intent clearly points to Driftwood.
Wing eliminates this interference through hard filtering (not soft weighting). This is a design choice with a cost --- if the user genuinely wants to search all auth-related content across projects, they need to omit the Wing filter. But MemPalace's benchmarks show that in the vast majority of real queries, users are indeed interested only in information within a specific domain.
Trade-offs
Wing uses hard filtering rather than soft weighting, which means:
Advantage: The reduction in search space is deterministic. If the Palace has 8 Wings, specifying a Wing reduces the search space to roughly 1/8. At the scale of 22,000 memories, this means shrinking from 22,000 to approximately 2,750 --- vector retrieval precision improves significantly on a smaller candidate set.
Cost: If Wing assignment is wrong, the target document is completely excluded. This is not a "ranking drops" problem --- it is a "completely invisible" problem. This is why Wing assignment must be a high-confidence decision, typically based on explicit metadata (which project directory a file comes from, which person's name is mentioned in the conversation), rather than fuzzy semantic inference.
Tier Two: Hall --- Cognitive Classification
What a Hall Is
Hall is the second-level partition within a Wing, classified by the cognitive type of the memory. MemPalace defines five fixed Hall types:
hall_facts — established facts and decisions
hall_events — events, meetings, milestones
hall_discoveries — breakthroughs, new findings, insights
hall_preferences — preferences, habits, opinions
hall_advice — suggestions, recommendations, solutions
(mcp_server.py:111)
These five Hall types exist in every Wing. They are "corridors" --- connecting different Rooms within the same Wing while tagging memories by cognitive type.
In the palace graph, Halls appear as edge attributes. palace_graph.py associates each Room with its Hall when constructing the graph:
room_data[room]["halls"].add(hall)
(palace_graph.py:60)
In the graph's edges, a Hall describes the "corridor type" traversed by a tunnel connecting two Wings:
edges.append({
"room": room,
"wing_a": wa,
"wing_b": wb,
"hall": hall,
"count": data["count"],
})
(palace_graph.py:75-84)
Why Five Fixed Types
The choice of five Hall types is not arbitrary --- it reflects an assumption about human cognitive classification: people tend to classify memories using a relatively fixed set of modes.
Everything you remember roughly falls into one of these categories: it is a fact ("we use PostgreSQL"), an event ("yesterday's deployment had issues"), a discovery ("it turns out the connection pool was set too small"), a preference ("I prefer Terraform over Pulumi"), or a piece of advice ("next time this happens, check the logs first").
The key characteristic of this classification system is that it is mutually exclusive and exhaustive. A single memory typically belongs to only one type --- "we decided to use Clerk" is a fact, not an event; "Kai recommended Clerk" is advice, not a preference. A few edge cases may be ambiguous (a discovery might also be a fact), but the classification of the vast majority of memories is clear.
Why not three types? Because three types (e.g., facts/events/opinions) are too coarse --- "Kai recommended Clerk" and "I prefer Clerk" would both be "opinions" under three categories, but their query intents are entirely different. One is about "who said what" (tracing the source of advice); the other is about "what is my position" (confirming a preference).
Why not ten types? Because the finer the classification, the lower the classification accuracy. Five is a balance point between discriminative power and classification accuracy. Each Hall has sufficiently clear semantic boundaries that automatic classification (whether keyword-based or LLM-based) can achieve high accuracy.
Trade-offs
The fixedness of Halls is both their strength and their limitation.
Advantage: Because Hall types are predefined, all Wings have a consistent internal structure. This makes cross-Wing comparison possible --- you can compare wing_kai/hall_advice with wing_priya/hall_advice to see what advice different people gave on the same topic. If each Wing had a different internal classification scheme, such comparisons would be impossible.
Limitation: Five Halls may not cover all types of memory. For instance, "emotional memory" ("I felt very anxious about that decision") and "metacognitive memory" ("that debugging session made me realize my understanding of connection pools was wrong") may not fit neatly into any existing Hall. The DEFAULT_HALL_KEYWORDS in config.py actually defines a set of classification dimensions different from the five Halls (emotions, consciousness, memory, etc.), hinting that the system considered different classification schemes during its evolution.
Tier Three: Room --- Named Concept Nodes
What a Room Is
Room is the most important semantic unit in the palace. Each Room represents a named concept --- an idea specific enough, independent enough, and significant enough to deserve its own space.
Rooms use slug-format naming: hyphen-separated lowercase English strings. For example:
auth-migration
chromadb-setup
gpu-pricing
riley-college-apps
(mcp_server.py:113)
At the search layer, Room is another filter dimension parallel to Wing. When both Wing and Room are specified, searcher.py constructs a combined filter:
if wing and room:
where = {"$and": [{"wing": wing}, {"room": room}]}
(searcher.py:30-31)
In the palace graph, Rooms are the graph's nodes. The build_graph() function in palace_graph.py iterates over all metadata in ChromaDB, constructing a node record for each Room:
room_data = defaultdict(lambda: {
"wings": set(), "halls": set(),
"count": 0, "dates": set()
})
(palace_graph.py:47)
Each Room node records which Wings it appears in, which Halls it belongs to, how many memories it contains, and the most recent date. When a Room appears in multiple Wings, it forms a Tunnel --- the subject of the next chapter.
Why Slugs
Room names use slug format (auth-migration rather than "Auth Migration" or auth_migration_2026) for three engineering reasons:
First, slugs are unambiguous identifiers. They contain no spaces, uppercase letters, special characters, or date suffixes. This means the same concept has exactly the same Room name string across different Wings --- auth-migration is auth-migration, and tunnel detection will not fail because one Wing wrote "Auth Migration" while another wrote "auth_migration."
Second, slugs are human-readable. Unlike hash values or numeric IDs, slugs carry semantic information. When you see the Room name gpu-pricing, you immediately know it is about GPU pricing. This is crucial for debugging, log analysis, and user interface display.
Third, slugs are composable. In ChromaDB's metadata system, slugs can be used directly as where filter condition values without any encoding or escaping. In the file system, slugs can serve directly as directory names. In URLs, slugs can appear directly in the path. This universality reduces friction when passing Room identifiers between different systems.
The Emergent Nature of Rooms
A key design decision is that Rooms are not predefined but emerge from the data. There is no "Room list" dictating which Rooms you may have. When MemPalace mines your conversations, it creates Rooms based on conversation content --- if you discussed a GraphQL migration, a graphql-switch Room appears; if you discussed Riley's college applications, a riley-college-apps Room appears.
This contrasts sharply with Hall's design: Halls are predefined, fixed, and globally consistent; Rooms are dynamic, data-driven, and can differ across Wings. This contrast reflects a deep design intuition: the types of memory are finite, but the content of memory is infinite. Halls encode type (what kind of thing you are remembering); Rooms encode content (what thing you are remembering). Types can be predefined; content must grow from the data.
Tier Four: Closet --- The Compression Gateway
What a Closet Is
Closet is an intermediate tier between Room and Drawer. It contains compressed summaries of original content --- enough for the AI to judge "does this Drawer contain the information I need," but without the full detail of the original text.
In the current implementation, Closet functionality is primarily realized implicitly through ChromaDB's embedding vectors and metadata system. Each memory's embedding vector is the "label on the closet door" --- the search engine decides whether to "open this closet" and examine its Drawers by comparing embedding vectors. The search_memories() function in searcher.py returns results containing both original text (Drawer content) and metadata (Closet information):
hits.append({
"text": doc, # Drawer: original text
"wing": meta.get("wing", "unknown"),
"room": meta.get("room", "unknown"),
"source_file": ..., # Closet: source information
"similarity": ..., # Closet: match score
})
(searcher.py:128-134)
The README mentions that AAAK compression will be introduced to the Closet tier in a future version: "In our next update, we'll add AAAK directly to the closets, which will be a real game changer --- the amount of info in the closets will be much bigger, but it will take up far less space and far less reading time for your agent."
Why an Intermediate Tier Is Needed
The existence of Closet solves a classic problem in information retrieval: you need to find a balance between "reading all content" and "reading only titles."
If the AI needs to read the full content of every Drawer in a Room to judge relevance, then when a Room contains hundreds of memories, token consumption becomes unacceptable. But if the AI judges based only on the Room name, precision is insufficient --- the auth-migration Room might contain entirely different content about migration reasons, migration process, bugs encountered during migration, and post-migration retrospectives.
Closet provides a mid-precision view: it tells the AI "roughly what is in this closet," letting the AI decide whether to open the Drawer and read the original text. Once AAAK compression is introduced, this intermediate tier will become even more efficient --- 30x compression means a Closet can encode a large amount of content overview in very few tokens.
Tier Five: Drawer --- The Original Truth
What a Drawer Is
Drawer is the lowest-level unit in MemPalace, storing original content. Each Drawer contains a piece of raw text --- a fragment of conversation, a chapter of a document, a code comment. MemPalace's core promise is: content in a Drawer is always original, verbatim, and unmodified by any summarization.
In the MCP server, the tool function for adding a Drawer explicitly requires content to be "verbatim":
"content": {
"type": "string",
"description": "Verbatim content to store "
"--- exact words, never summarized",
},
(mcp_server.py:626-629)
When adding a new Drawer, the system first performs a duplicate check --- if the content's similarity to an existing Drawer exceeds 90%, the addition is rejected:
dup = tool_check_duplicate(content, threshold=0.9)
if dup.get("is_duplicate"):
return {
"success": False,
"reason": "duplicate",
"matches": dup["matches"],
}
(mcp_server.py:259-265)
Each Drawer's ID includes the Wing, Room, and content hash, ensuring uniqueness:
drawer_id = f"drawer_{wing}_{room}_{hashlib.md5(
(content[:100] + datetime.now().isoformat()
).encode()).hexdigest()[:16]}"
(mcp_server.py:267)
Why It Must Be Original
This is one of MemPalace's most central design decisions and the fundamental point of divergence from the vast majority of competing systems.
The typical practice of mainstream AI memory systems is to have the LLM extract "important information" at the storage stage. Mem0 uses an LLM to extract facts; Mastra uses GPT to observe conversations and generate structured records. These systems make an irreversible decision at the storage stage --- what is "important."
The problem is that "importance" is context-dependent. A detail that seems unimportant at storage time ("Kai mentioned he previously worked at a company that used Auth0") may become critical in a future query ("who has actual Auth0 experience?"). Once the storage stage discards this detail, it can never be retrieved.
MemPalace's stance is: store everything, and let the retrieval stage decide what is important. Drawers preserve original text, Closets provide rapid navigation, and Wing/Hall/Room provide structured filtering --- but the information itself is never modified or discarded.
The Five-Tier Structure in the MCP API
In the MCP server, the five-tier structure is exposed to AI agents through a set of tools:
mempalace_list_wings — list all Wings (tier one)
mempalace_list_rooms — list Rooms within a Wing (tier three)
mempalace_get_taxonomy — complete Wing -> Room -> count tree
mempalace_search — search, with optional Wing/Room filtering
mempalace_add_drawer — add original content to a specified Wing/Room
(mcp_server.py:441-637)
The tool_get_taxonomy() function builds the complete hierarchical view:
for m in all_meta:
w = m.get("wing", "unknown")
r = m.get("room", "unknown")
if w not in taxonomy:
taxonomy[w] = {}
taxonomy[w][r] = taxonomy[w].get(r, 0) + 1
(mcp_server.py:163-168)
The returned taxonomy object is a nested dictionary: {wing: {room: count}}. This allows an AI agent to understand the entire palace's structure in a single call, then decide which Wing and Room to search in.
Note a subtle design decision: the MCP API directly exposes Wings and Rooms but does not separately expose Halls. Halls exist as metadata within Drawers but are not an independent filter dimension. This reflects a pragmatic judgment: in current usage patterns, the Wing + Room combination already provides sufficient search precision (+34%), while Hall's additional filtering benefit is relatively small but would increase API complexity and classification error risk.
The Complete Design Picture
graph LR
subgraph "Search Path"
Q[Query] --> WF{Wing Filter}
WF --> RF{Room Filter}
RF --> VS[Vector Search]
VS --> CL[Closet Match]
CL --> DR[Drawer Returns Original Text]
end
subgraph "Precision Improvement"
A[Full search: 60.9%] --> B[+Wing: 73.1%]
B --> C[+Hall: 84.8%]
C --> D[+Room: 94.8%]
end
The five-tier structure's retrieval effect is cumulative --- each additional tier of structural constraint incrementally raises R@10 from 60.9% to 94.8%, a total of +34 percentage points (see Chapter 7 for the complete benchmark analysis). This is not because each tier performs "finer filtering" --- it is because each tier eliminates a specific class of interference. Wings eliminate cross-domain interference, Halls eliminate cross-type interference, and Rooms eliminate cross-concept interference.
These three types of interference manifest differently in vector space. Cross-domain interference is the strongest (auth discussions from two different projects are highly semantically similar); cross-type interference is moderate (facts about auth and advice about auth within the same project are semantically related but not identical); cross-concept interference is the weakest (auth discussions and billing discussions within the same project have lower semantic similarity). This explains why each tier's improvement diminishes: 12% -> 12% -> 10%. The easier an interference source is to distinguish, the less structural constraint is needed to eliminate it.
Back to Ancient Greece
The five-tier structure's design can be understood as a high-fidelity translation of Simonides' Method of Loci.
In the Method of Loci, you first choose a building (Wing), then enter a floor or functional area (Hall), then arrive at a specific room (Room), then notice a particular object in the room (Closet), and finally retrieve the information you need from the object (Drawer).
MemPalace did not try to "improve" the Method of Loci --- it tried to translate the Method of Loci's structure into computable form as faithfully as possible. A Wing is not a "database partition" --- it is "a wing of a building." A Room is not a "folder" --- it is "a specific location you pass through as you walk through your mind." These names are not decorative metaphors --- they constrain design decisions.
When an engineer hears "database partition," they pursue uniform data distribution. When they hear "a wing of a building," they accept that different wings have different sizes --- because in a real building, the kitchen and the bedroom do not need to be the same size. This subtle difference in cognitive framing influences dozens of subsequent design choices, ultimately leading to a system significantly different from one driven by pure engineering considerations.
The next chapter will discuss the most distinctive emergent property of the five-tier structure --- Tunnels: the cross-domain connections that automatically form when the same Room appears in multiple Wings.
Chapter 6: Tunnels --- Cross-Domain Discovery
Positioning: The design and implementation of the Tunnel mechanism --- how a zero-cost graph construction strategy automatically surfaces cross-domain knowledge connections from ChromaDB's metadata.
Same Room, Different Wings
The previous chapter described how Wing, Hall, and Room progressively narrow the search space at the concept level. But these hierarchical structures have an inherent side effect: they tend to create silos. If all searches are confined to a single Wing, you can never discover the connection between "Kai's experience with the auth migration" and "the Driftwood project's auth migration decision" --- because they belong to different Wings.
MemPalace solves this problem with an extraordinarily simple mechanism: Tunnels.
The definition of a tunnel is a single sentence: when the same Room appears in two or more Wings, a tunnel is automatically formed between those Wings. No manual linking required, no additional index construction, no LLM reasoning. Same name means connected.
graph LR
subgraph wing_kai
R1[auth-migration]
end
subgraph wing_driftwood
R2[auth-migration]
end
subgraph wing_priya
R3[auth-migration]
end
R1 ---|tunnel| R2
R2 ---|tunnel| R3
R1 ---|tunnel| R3
style R1 fill:#4a9eff,color:#fff
style R2 fill:#4a9eff,color:#fff
style R3 fill:#4a9eff,color:#fff
In this example, the auth-migration Room appears in three Wings --- wing_kai (Kai's personal experience and work records), wing_driftwood (project-level decisions and progress), and wing_priya (Priya's approvals and recommendations as tech lead). Three Wings automatically form three tunnels through this shared Room.
The README example clearly illustrates the semantic meaning of these connections:
wing_kai / hall_events / auth-migration
-> "Kai debugged the OAuth token refresh"
wing_driftwood / hall_facts / auth-migration
-> "team decided to migrate auth to Clerk"
wing_priya / hall_advice / auth-migration
-> "Priya approved Clerk over Auth0"
Same topic (auth migration), three perspectives (implementer, project, decision-maker), three memory types (event, fact, advice). Tunnels connect these perspectives, allowing you to start from any one entry point and discover other related memories.
The Graph Is Built from Metadata
The tunnel implementation relies on the build_graph() function in palace_graph.py. This function is the core of the entire tunnel mechanism, and its design embodies a key engineering insight: no additional graph database is needed.
build_graph() works by iterating over all document metadata in ChromaDB, extracting Room, Wing, and whatever Hall metadata happens to exist, then constructing a graph in memory. The code:
room_data = defaultdict(lambda: {
"wings": set(), "halls": set(),
"count": 0, "dates": set()
})
(palace_graph.py:47)
For each memory record, the function extracts its metadata and updates the corresponding Room's node information:
for meta in batch["metadatas"]:
room = meta.get("room", "")
wing = meta.get("wing", "")
hall = meta.get("hall", "")
if room and room != "general" and wing:
room_data[room]["wings"].add(wing)
if hall:
room_data[room]["halls"].add(hall)
room_data[room]["count"] += 1
(palace_graph.py:52-63)
Note that room_data[room]["wings"] is a set. When the same Room is added from different Wings, this set naturally accumulates all Wings that Room spans. Tunnel detection is simply checking whether this set's size is greater than 1:
for room, data in room_data.items():
wings = sorted(data["wings"])
if len(wings) >= 2:
for i, wa in enumerate(wings):
for wb in wings[i + 1:]:
for hall in data["halls"]:
edges.append({
"room": room,
"wing_a": wa, "wing_b": wb,
"hall": hall,
"count": data["count"],
})
(palace_graph.py:70-84)
This code's logic warrants close examination. For each Room spanning two or more Wings, the function generates edges for all pairwise Wing combinations. If a Room appears in 3 Wings, it generates 3 edges (A-B, A-C, B-C). Each edge is also associated with all Halls that Room belongs to, along with the Room's memory count --- this count is later used as a sorting weight during graph traversal.
Zero additional storage cost. This is the most noteworthy aspect of the design. The graph is not stored in ChromaDB or in any external database. It is dynamically constructed from ChromaDB's metadata each time it is needed. The important caveat is that the current main write paths reliably populate wing and room, while hall is a richer optional metadata layer rather than a guaranteed field on every stored memory. So the graph always has a room-and-wing backbone, and Hall becomes an enhancement when that metadata is present.
The trade-off of this design is obvious: every query requires rebuilding the graph. At the scale of 22,000 memories, build_graph() needs to read all metadata in batches (1,000 per batch), meaning at least 22 ChromaDB calls. For real-time interactive scenarios, this may introduce perceptible latency. But MemPalace's choice is to accept this latency in exchange for zero additional storage and zero data consistency maintenance cost.
BFS Traversal: Starting from a Room
Knowing the graph exists is not enough --- you need to be able to walk through it. The traverse() function in palace_graph.py implements a breadth-first search (BFS) traversal, allowing you to start from a given Room and discover all reachable related Rooms.
def traverse(start_room, col=None, config=None,
max_hops=2):
(palace_graph.py:99)
The traversal logic is standard BFS, but the connection relationship is unique: two Rooms are connected if and only if they share at least one Wing.
frontier = [(start_room, 0)]
while frontier:
current_room, depth = frontier.pop(0)
if depth >= max_hops:
continue
current_wings = set(current.get("wings", []))
for room, data in nodes.items():
if room in visited:
continue
shared_wings = current_wings & set(data["wings"])
if shared_wings:
visited.add(room)
results.append({
"room": room,
"hop": depth + 1,
"connected_via": sorted(shared_wings),
})
(palace_graph.py:128-154)
The max_hops parameter (default 2) controls traversal depth. Setting it to 2 means you can discover "Rooms that directly share a Wing with the starting Room" (1 hop) and "Rooms that share yet another Wing with those Rooms" (2 hops). Within two hops, all semantically meaningful connections are typically covered; more distant connections tend to be too indirect to carry informational value.
Traversal results are sorted by (hop_distance, -count):
results.sort(key=lambda x: (x["hop"], -x["count"]))
return results[:50]
(palace_graph.py:157-158)
Connections with fewer hops are shown first; at equal hop counts, Rooms with higher occurrence counts are prioritized. Rooms with high occurrence counts are typically more important concept nodes --- they have accumulated more memory entries, meaning the topic has been discussed more frequently.
Tunnel Discovery
Beyond graph traversal from a starting point, MemPalace also provides a dedicated tunnel discovery tool: find_tunnels().
def find_tunnels(wing_a=None, wing_b=None,
col=None, config=None):
(palace_graph.py:161)
This function's purpose is not navigation but discovery. It answers the question: "which topics connect these two domains?"
for room, data in nodes.items():
wings = data["wings"]
if len(wings) < 2:
continue
if wing_a and wing_a not in wings:
continue
if wing_b and wing_b not in wings:
continue
tunnels.append({
"room": room, "wings": wings,
"halls": data["halls"],
"count": data["count"],
})
tunnels.sort(key=lambda x: -x["count"])
(palace_graph.py:169-189)
You can specify no Wings (view all tunnels), one Wing (view all tunnels related to that Wing), or two Wings (view bridging topics between those two specific domains).
In the MCP server, this functionality is exposed to AI agents through the mempalace_find_tunnels tool:
"mempalace_find_tunnels": {
"description": "Find rooms that bridge two wings "
"--- the hallways connecting different "
"domains.",
...
}
(mcp_server.py:571-581)
The tool description calls tunnels "hallways" that are "connecting different domains." This wording reflects the essence of tunnels: they are not manually created indexes or links, but connections that naturally emerge when you discuss the same topic across different domains.
The Information-Theoretic Significance of Tunnels
The tunnel mechanism looks simple to the point of being trivial --- is it not just "same-name rooms automatically link"? But this simplicity masks a deep design insight.
In knowledge management systems, the most valuable information is typically found not within domains but at the intersection of domains. A search entirely within wing_code can tell you "how our auth module works" but cannot tell you "why the auth module was designed this way" --- because the design rationale might be recorded in wing_team's meeting notes or in wing_priya's technical recommendations.
Traditional knowledge management systems handle cross-domain connections in two ways:
Manual linking. Have users or administrators explicitly create cross-domain associations. This approach is precise but fragile --- it depends on human memory and diligence, and the cost of maintaining these links grows exponentially as data volume increases.
Global semantic search. Abandon domain partitioning and perform vector retrieval across the entire database. This approach requires no link maintenance but returns to the problem discussed in Chapter 4 --- high-dimensional degradation in large-scale vector spaces causes retrieval precision to decline.
MemPalace's tunnel mechanism is a third path: let structure automatically produce connections. You do not need to manually annotate "Kai's auth experience is related to Driftwood's auth decision" --- when you use the same Room name auth-migration in two different Wings, the association already exists. Room's slug format ensures naming consistency, and build_graph() ensures these consistencies are automatically detected.
This design rests on a reasonable assumption: if memories from two different domains are assigned the same Room name, a semantic association genuinely exists between them. This assumption holds in the vast majority of cases --- you will not accidentally use the same Room slug in two unrelated domains.
Graph Statistics: A Global View of the Palace
The graph_stats() function provides global statistics for the palace graph:
def graph_stats(col=None, config=None):
nodes, edges = build_graph(col, config)
tunnel_rooms = sum(
1 for n in nodes.values()
if len(n["wings"]) >= 2
)
return {
"total_rooms": len(nodes),
"tunnel_rooms": tunnel_rooms,
"total_edges": len(edges),
"rooms_per_wing": dict(wing_counts...),
"top_tunnels": [...],
}
(palace_graph.py:193-213)
This statistical view lets users and AI agents understand the palace's overall topology: how many Rooms exist, how many form tunnels, how dense the connections between Wings are, and which tunnels are most active.
The top_tunnels list is sorted by Wing count in descending order --- Rooms appearing in the most Wings are listed first. These highly connected Rooms typically represent the user's most core concerns --- topics that recur across multiple projects, multiple relationships, and multiple time periods.
An Honest List of Design Trade-offs
The tunnel mechanism's elegance should not obscure its limitations. Here are the trade-offs that must be honestly addressed:
Naming consistency dependency. Tunnel detection depends entirely on exact Room name matching. If one Wing uses auth-migration and another uses clerk-migration, even if they discuss the same thing, no tunnel will form. This places high demands on the automation of Room naming --- the naming function must be smart enough to ensure the same concept receives the same slug across different contexts.
Dynamic graph reconstruction cost. As mentioned above, every graph operation requires rebuilding the graph from ChromaDB. At large scales (tens of thousands of memories), this means dozens of database read operations. A possible improvement direction is introducing a graph caching layer --- but this would introduce cache consistency complexity.
Dense tunnel noise. If a Room appears in every Wing (e.g., an extremely generic concept like general-discussion), it forms too many tunnel connections, reducing the informational value of tunnels. build_graph() mitigates this by filtering out room == "general" (palace_graph.py:57), but for other high-frequency, low-information-value Rooms, there is currently no systematic filtering mechanism.
These limitations are all solvable engineering problems, not fundamental design flaws. The tunnel mechanism's core value --- surfacing cross-domain connections at zero cost from existing metadata --- is complete and unshaken by these implementation-level limitations.
The Value of Connection
The tunnel mechanism conceptually completes the transformation of the memory palace from "building" to "network."
In the structure described in the previous two chapters, the palace is a strict hierarchical tree: Palace -> Wing -> Hall -> Room -> Closet -> Drawer. Information is organized top-down, and search operates within a subtree. This structure is efficient and clear, but it is closed --- each subtree is an isolated island.
Tunnels break this closure. They add lateral edges to the tree, transforming it from a tree into a graph. You can no longer only move up and down within a single Wing --- you can cross from one Wing to another through tunnels, discovering connections invisible in a purely hierarchical structure.
This is why the palace_graph.py file header comment describes this module as "a navigable graph" --- not "a navigable tree." Trees are hierarchical, deterministic, and top-down; graphs are networked, emergent, and explorable from any starting node. MemPalace's five-tier structure provides the efficiency of a tree; the tunnel mechanism provides the discovery power of a graph.
The next chapter will use benchmark data to prove that the retrieval improvement from this "structure + connection" combination is not theoretical speculation --- it is a quantifiable, reproducible 34%.
Chapter 7: The 34% Retrieval Improvement Is Not a Coincidence
Positioning: Using data to prove that structure is the product --- where the 34% retrieval precision improvement comes from, why it is reproducible, and its broader implications for AI memory system design.
Four Numbers
On a benchmark test using over 22,000 real conversation memories, MemPalace recorded the following R@10 (Recall at 10 --- the probability that the correct answer appears in the top 10 results) data:
| Search Scope | R@10 | Improvement Over Baseline |
|---|---|---|
| Full search (no structure) | 60.9% | -- |
| Wing-scoped search | 73.1% | +12.2% |
| Wing + Hall | 84.8% | +23.9% |
| Wing + Room | 94.8% | +33.9% |
graph LR
A["Full search<br/>R@10: 60.9%"] --> B["Wing filter<br/>R@10: 73.1%<br/>+12%"]
B --> C["Wing+Hall<br/>R@10: 84.8%<br/>+24%"]
C --> D["Wing+Room<br/>R@10: 94.8%<br/>+34%"]
style D fill:#2d5,color:#fff
This set of data needs to be carefully interpreted.
60.9% is the baseline --- all memories placed in a flat vector database with no structural organization, using ChromaDB's default embedding model (all-MiniLM-L6-v2) for semantic retrieval. This baseline represents the performance ceiling of "pure vector search."
94.8% is the result after applying Wing + Room filtering. The same data, the same embedding model, the same retrieval algorithm --- the only variable is metadata filtering applied before search. From 60.9% to 94.8%, an improvement of 33.9 percentage points, entirely from structure. No better model, no larger embedding dimensions, no LLM reranking. Simply telling the search engine "look in this Wing's this Room."
The credibility of this data depends on three factors: data scale, test methodology, and reproducibility. 22,000 memories is not a small toy dataset --- it represents months of real usage accumulation. The test methodology follows standard information retrieval evaluation paradigms (R@K metrics). As for reproducibility, MemPalace provides complete benchmark run scripts in the benchmarks/ directory; anyone can repeat these tests on their own data.
First-Tier Improvement: Wing Eliminates Cross-Domain Interference
From 60.9% to 73.1%, Wing produced a 12.2-percentage-point improvement. This is the largest single-tier improvement, and its cause is the most intuitive.
Consider a query: "Why did we choose Clerk for auth?"
In a full search, ChromaDB searches all 22,000 memories for the 10 most semantically similar to this query. The problem is that if you discussed auth-related topics in multiple projects --- say, Driftwood chose Clerk, Orion uses Auth0, and a personal project uses Firebase Auth --- the auth discussions from these three domains will be very close in vector space. Three sets of memories compete for top-10 positions, and the correct answer (Driftwood's Clerk decision record) may be pushed to position 11 or 12.
With Wing filtering added (wing="wing_driftwood"), the search space shrinks to Driftwood project memories. Two of the three sets of semantically similar auth discussions are directly excluded. The correct answer no longer needs to compete with results from unrelated domains.
This improvement is essentially leveraging prior knowledge. When the user or AI agent knows it is asking a question about Driftwood, this knowledge can be encoded as a Wing filter --- simplifying the search problem from "find the correct answer among 22,000 memories" to "find the correct answer among approximately 2,750 Driftwood memories."
The theoretical basis of vector search is that the correct answer should be closest to the query in embedding space. But when the candidate set contains many "close but incorrect" interference items, this theoretical assumption breaks down. Wing restores the validity of this assumption by removing the strongest interference source --- documents from entirely different domains that are semantically similar.
Second-Tier Improvement: Hall Distinguishes Memory Types
From 73.1% to 84.8%, Hall added another 11.7 percentage points on top of Wing.
This improvement is more subtle. Even within the same Wing, semantic overlap exists between different types of memories. "We decided to use Clerk" (fact), "Kai recommended Clerk because of better pricing and developer experience" (advice), and "the Clerk adoption was finalized at last Wednesday's meeting" (event) --- these three memories may be very close in vector space because they all contain keywords like "Clerk," "auth," and "decision."
But a query's intent usually points to only one type. If you ask "why did we choose Clerk," you want advice (hall_advice) or facts (hall_facts), not event records. If you ask "at which meeting was Clerk decided," you want events (hall_events), not technical advice.
Hall filtering eliminates this within-domain interference by distinguishing memory types. The five Halls (facts, events, discoveries, preferences, advice) correspond to five different query intent patterns. When the search system can correctly infer a query's type intent, it can further narrow the candidate set, excluding type-mismatched results.
It is worth noting that Hall's improvement (11.7%) is nearly equal to Wing's improvement (12.2%). This means the strength of cross-type interference is roughly comparable to cross-domain interference --- a somewhat counterintuitive finding. You might expect interference from entirely different domains to be far stronger than interference from different types within the same domain, but distance distributions in vector space do not always match human intuition. In high-dimensional space, the distance differences between different types of text within the same domain can be as small as the distance differences between domains.
Third-Tier Improvement: Room Pinpoints Concepts
From 84.8% to 94.8%, Room produced the final 10 percentage points of improvement.
If Wing eliminates domain interference and Hall eliminates type interference, then Room eliminates concept interference. Even within the same Wing and the same Hall, multiple different concepts may exist. wing_driftwood/hall_facts might contain facts about auth migration, database selection, deployment strategy, and team organization. They are all "facts," all belong to "Driftwood," but they are about different things.
Room eliminates this last level of semantic ambiguity through named concept nodes (auth-migration, database-selection, deploy-strategy). When the search is scoped to wing_driftwood/room=auth-migration, the candidate set contains only memories about auth migration --- at this point, vector search merely needs to distinguish "the most relevant few" among a small number of highly relevant documents, which is a problem vector search handles well.
Room's 10% improvement, while the smallest of the three tiers, pushes retrieval precision from 84.8% to 94.8% --- crossing the 90% threshold generally regarded in engineering practice as the dividing line for "usable." From the user experience perspective, 84.8% means roughly one in every six searches fails to find the correct answer; 94.8% means roughly one in every twenty misses. This difference is perceptible in daily use.
Why Structure Works: High-Dimensional Degradation in Vector Spaces
The analysis above explains what each tier "does" but has not yet answered a more fundamental question: why can metadata filtering alone produce such a large improvement? Metadata filtering does not change embedding vector quality or distance calculation methods --- it only reduces the number of candidates participating in comparison. Why does reducing candidates improve precision?
The answer relates to a fundamental property of high-dimensional vector spaces: the curse of dimensionality.
MemPalace's default embedding model, all-MiniLM-L6-v2, generates 384-dimensional vectors. In 384-dimensional space, a repeatedly verified phenomenon is: as dataset size grows, the distance distribution among data points tends to concentrate --- the distance difference between the nearest and farthest neighbors becomes smaller and smaller.
In more intuitive terms: imagine standing at the center of a 384-dimensional space with 22,000 points around you. These points' distances from you might be distributed between 0.3 and 0.7. Now you need to find the 10 closest points. The problem is that within this distance range, hundreds of points might have distances falling between 0.31 and 0.35 --- the distance differences between them are smaller than measurement noise. At this precision, the distance difference between "ranked first" and "ranked fiftieth" might be only 0.01 --- far below any meaningful discrimination threshold.
Now, if you shrink the candidate set from 22,000 to 2,750 through Wing filtering, the concentration of the distance distribution decreases. In a smaller candidate set, the distance gap between the correct answer and the nearest interference item widens. In information-theoretic terms, you have improved the signal-to-noise ratio --- not by strengthening the signal (better embeddings) but by reducing the noise (removing irrelevant candidates).
This is the value of structure: structure is not a better search algorithm; it is a better precondition for search.
Structure as Prior
Bayesian statistics has a core concept: the prior. Before observing data, you have an initial belief distribution about the problem's answer. The more informative the prior, the less data you need to arrive at an accurate posterior.
The Wing/Hall/Room structure plays exactly the role of a prior in retrieval.
Without structure, the search system's "prior" is a uniform distribution --- each of the 22,000 memories has an equal probability of being the correct answer. Embedding distance is the only source of evidence.
With structure, the search system's "prior" is dramatically updated --- after specifying a Wing, only about 1/8 of memories have a reasonable probability of being the correct answer; after further specifying a Room, perhaps only a few dozen memories are in the candidate range. Embedding distance is still the evidence source, but it now only needs to discriminate within a much smaller candidate set --- a far easier task.
The 34% improvement is essentially quantifying the value of prior information. When you tell the search system "the answer is in this Wing's this Room," you provide approximately 7--8 bits of prior information (shrinking from 22,000 to a few dozen). This information comes not from a better model or more computation --- it comes from how the data is organized.
This also explains why the improvement is robust --- it does not depend on the choice of embedding model, the type of query, or the data domain. As long as the following conditions hold, structure will produce improvement:
- The dataset is large enough that full search faces high-dimensional degradation;
- Structural partitions are semantically coherent, so the correct answer likely falls in the correct partition;
- The semantic distance between partitions is greater than the semantic distance within partitions.
These three conditions hold in the vast majority of real-world AI memory scenarios.
Control Group: Systems Without Structure
To verify that structure is indeed the key variable, it is worth comparing MemPalace against systems that do not use structure on the same benchmark.
On the LoCoMo benchmark (1,986 multi-hop QA pairs), the comparison across different systems is as follows:
| System | Method | R@10 | Notes |
|---|---|---|---|
| MemPalace (session, no structure) | Pure vector search | 60.3% | Baseline |
| MemPalace (hybrid v5) | Vector + keyword + time weighting | 88.9% | Hybrid scoring |
| MemPalace (hybrid + Sonnet rerank) | Hybrid + LLM reranking | 100% | Perfect score in all categories |
The 60.3% baseline is nearly identical to the 60.9% mentioned above --- this is not a coincidence but validation of the same pattern: on memory sets at the ten-thousand scale, pure vector search R@10 hovers around 60%.
From 60.3% to 88.9% (hybrid v5), a 28.6-percentage-point improvement. This improvement comes from keyword overlap scoring, time weighting, and person-name boosting --- essentially introducing additional ranking signals beyond vector distance. These signals are not structural (they do not depend on Wing/Room partitioning) but heuristic.
From baseline to 100% (with Sonnet rerank), a total improvement of 39.7 percentage points. LLM reranking contributed the final 11.1 percentage points from 88.9% to 100%.
Comparing these figures with the structural improvement:
- Pure structure (Wing + Room filtering): +34%
- Hybrid scoring heuristics: +28.6%
- LLM reranking: +11.1%
The structural improvement and hybrid heuristic improvement are of the same order of magnitude. But the cost difference between the two is enormous: the structural improvement has zero computational cost (merely adding a where clause), while hybrid heuristics require additional text processing, tokenization, and scoring computation. LLM reranking further demands API calls and additional latency.
Structure is the cheapest source of precision.
Reproducibility
In the benchmarking world, an unreproducible result is as good as nonexistent. MemPalace provides a complete reproduction path in the benchmarks/ directory.
Core benchmark scripts include:
longmemeval_bench.py--- LongMemEval benchmark runnerlocomo_bench.py--- LoCoMo benchmark runnermembench_bench.py--- MemBench benchmark runner
Each script accepts a dataset path and mode parameter and outputs results files in standard format. benchmarks/BENCHMARKS.md records the complete improvement journey from the 96.6% baseline to the 100% perfect score --- not as marketing material but as a technical experiment log, documenting in detail the motivation, method, and result of each improvement step.
For example, the first improvement from 96.6% to 97.8% (hybrid scoring v1) was motivated by identifying a specific failure mode: queries containing exact terms (such as "PostgreSQL" or "Dr. Chen"), where pure embedding similarity would rank semantically similar but term-mismatched documents above exact matches. The fix was layering keyword overlap weighting on top of embedding distance.
The final step from 99.4% to 100% involved analyzing three questions where two independent architectures (hybrid v4 and palace mode) both failed, then designing targeted fixes for each: quoted phrase extraction, person-name weighting, and memory/nostalgia pattern matching. This "analyze failure -> design fix -> verify effect" cycle is an engineering-reliable improvement method --- not parameter tuning, but understanding the cause of failure.
Structure Is the Product
This chapter's core argument can be captured in a single sentence: in AI memory systems, how data is organized matters more than the choice of retrieval algorithm.
The 34% improvement does not require a better embedding model --- all-MiniLM-L6-v2 is a model released in 2020 with a relatively small parameter count, far from the current state of the art in embedding technology. It does not require LLM involvement in this benchmark path --- no API calls were made in the structural-improvement experiment itself. It does not require complex post-processing --- no reranking, no query expansion, no pseudo-relevance feedback.
It requires only three things:
- Data was assigned meaningful metadata labels (wing, hall, room) at storage time;
- Search used these labels to narrow the candidate set;
- The label system is semantically coherent --- data within the same Wing is indeed semantically related, and data in different Wings is indeed semantically different.
None of these three things requires AI. What they require is a good classification design --- and this classification design comes from a cognitive technique that is twenty-five hundred years old.
MemPalace's README has a line worth repeating: "Wings and rooms aren't cosmetic. They're a 34% retrieval improvement. The palace structure is the product." This is not a marketing slogan --- it is a direct summary of the benchmark data.
When your data organization itself is your product, adding better algorithms is icing on the cake, not the foundational zero-to-one. The 34% improvement is the starting line that structure gives you. On top of this starting line, hybrid scoring adds another 28%, LLM reranking adds another 11%, and the final result reaches 100%. But without that starting line, you begin at 60% --- meaning you need algorithms and LLMs to fill a much larger gap, and those all have costs.
Structure is free. That is its significance.
Next Steps
This chapter and the preceding three collectively complete the "memory palace" portion of the argument: from the cognitive science foundation of the Method of Loci (Chapter 4), to the design and implementation of the five-tier structure (Chapter 5), to the cross-domain discovery capability of the tunnel mechanism (Chapter 6), to the validation through benchmark data (this chapter).
But the memory palace is only one of MemPalace's three core designs. Structure solves the "how to find information" problem, but there is another equally critical question left unanswered: once you find the information, how do you convey it to the AI within an extremely small token budget?
A Wing may contain thousands of memories. Even if structural filtering narrows the candidate set to a few dozen, stuffing all of those complete texts into the AI's context window is still expensive (potentially requiring thousands or even tens of thousands of tokens). You need a compression method --- not summary extraction that discards important detail, but an AI-readable compact representation whose ideal goal is to preserve factual structure.
This is the problem the AAAK dialect was created to solve. Part 3 will analyze in depth how this 30x compression, zero-information-loss, AI-specific language was designed.
Chapter 8: The Constraint Space of Compression
Positioning: This chapter opens Part 3, "The AAAK Compression Language." We temporarily leave the palace's spatial structure behind and turn to an entirely different problem: how to compress large volumes of textual information into an extremely small token space without losing anything. This chapter does not present AAAK's specific syntax (that is Chapter 9's job), but rather derives what a "feasible solution" must look like through constraint satisfaction analysis.
The Nature of the Compression Problem
The previous chapters discussed how the memory palace's spatial structure improves retrieval precision by 34%. But structure only solves the "where to find" problem, not the "how much can fit" problem.
Let us return to the core number from the preface: six months of daily AI use produces approximately 19.5 million tokens. Even after organizing data through the palace structure, when the AI needs to load baseline context at the start of a session -- "who is this person, what project are they working on, who are they working with" -- the data volume remains hopelessly large.
The intuitive framing of this problem is simple: can we compress a 1,000-token natural language description into 30-120 tokens while still preserving the factual content a language model would later need?
This seemingly impossible requirement is precisely what AAAK attempts to answer.
But before discussing how AAAK achieves this, it is necessary to rigorously define what "achieves" means. The most common error in engineering is not that the solution is inadequate, but that the problem was not properly defined. A vague goal produces infinitely many seemingly reasonable approaches, each of which ultimately fails along some unforeseen dimension.
Four Non-Negotiable Constraints
MemPalace's compression requirements can be precisely stated as four constraints. These are not "nice-to-have" preferences but "must all be satisfied" hard requirements. If any one is not met, the entire approach is unusable.
Constraint 1: 30x Compression Ratio
The 19.5 million tokens of context over six months, even after memory layer filtering (discussed in detail in Chapters 14-15), still requires the L1 layer -- the critical facts layer that the AI must load at every startup -- to accommodate a large amount of information.
The README presents AAAK as capable of roughly 30x compression. That number needs to be unpacked. Two things must be distinguished first: the design goal discussed in the README and this chapter, and the concrete current behavior of the open-source plain-text compressor in dialect.compress(). The former asks what kind of compression language could plausibly satisfy the constraints. The latter is a current implementation with obvious heuristic selection built in.
AAAK is really doing two different kinds of work:
- Structured-expression compression -- removing stopwords, abbreviating entity names, and using pipe-delimited structure to express the same batch of facts more compactly. In short structured examples, this can deliver roughly 5-10x compression.
- Heuristic selection -- extracting key sentences, selecting topics, and truncating entities / emotions / flags. This is the current plain-text
compress()path: it chooses what seems worth retaining from longer text rather than performing a strict representation-preserving rewrite of every assertion.
For long, redundant conversation logs filled with phrases like "well, I think maybe we should consider...", the two effects combined can indeed reach the README's 30x number. But for already compact technical descriptions, the ratio is often closer to 5-10x.
So "30x" is an upper-bound figure for verbose conversation data, not the typical ratio for every text type. More importantly, the selection steps do not happen only in _extract_key_sentence(): topics, entities, emotions, and flags also involve top-k or heuristic pruning. In other words, the current repository's plain-text AAAK is closer to a high-compression index than to a strict lossless encoder, even though the original text remains preserved in Drawers.
That distinction is crucial for understanding AAAK: the README's lossless AAAK is a design target; the current open-source plain-text compressor is a lossy heuristic index; the 30x figure comes from structured expression and content selection acting together.
Constraint 2: Factual Assertion Completeness
This is the fundamental dividing line between MemPalace and all "summary-based" memory systems.
When Mem0 or Zep's system extracts "user prefers Postgres" from your conversation, it discards the entire context of you spending two hours explaining why you migrated from MongoDB. When an LLM is asked to "summarize the key points of this conversation," it must make a judgment -- what counts as a key point and what does not -- and that judgment itself is an irreversible act of information discarding.
MemPalace's position is that the ideal AAAK design target should preserve factual assertions as completely as possible. The README uses "lossless" to describe that target, but it needs both a precise operational definition and an honest separation between that target and the current heuristic implementation.
In the AAAK context, "lossless" works best as a design-constraint definition: for every factual assertion in the original text -- who, did what, when, why, with what result -- an ideal AAAK encoding should contain a corresponding representation such that a competent language model can reconstruct that assertion correctly.
But this definition has an important boundary: it describes the ideal constraint, not a line-by-line factual description of today's dialect.compress(). The current plain-text implementation performs selection on key sentences, topics, entities, emotions, and flags. Assertions removed by those filters do not appear in the AAAK output. MemPalace's safety net is that the original text remains preserved in Drawers, so AAAK functions more like a high-compression index than like the only copy of the memory. What gets lost is not the underlying stored memory, but the coverage of the compressed index.
This definition also intentionally excludes style, rhetoric, and phrasing preference. MemPalace stores memories, not literary works.
Constraint 3: Readable by Any Text Model
The implications of this constraint are more stringent than they appear on the surface.
"Any text model" means Claude, GPT-4, Gemini, Llama, Mistral -- including any future large language model capable of processing natural language text. The compression format cannot depend on any specific model's training data, special tokens, or fine-tuned behaviors.
This constraint directly eliminates all vector embedding-based compression approaches. A vector generated by OpenAI's text-embedding-ada-002 is a meaningless sequence of floating-point numbers to Llama. Even different versions of models from the same company may have incompatible embedding spaces.
It equally eliminates all encoding schemes based on specific tokenizer behavior. Different models tokenize the same text differently -- BPE, SentencePiece, WordPiece each have their variations -- and any scheme that uses tokenization boundaries to encode information will break when switching models.
The deeper implication of this constraint is: the compression format must work at the "text" level, not at the "model internal representation" level. Regardless of which model reads the compressed text, it should be able to extract the information purely through language comprehension -- not through some special decoding capability.
Constraint 4: No Decoder or Special Tools Required
This is the final constraint, and the one most easily underestimated.
In traditional data compression, compression and decompression are paired operations. gzip needs gunzip, zstd needs zstd -d, and every compression format comes with a decoder. In traditional computing this is not a problem, because running a decoding program has negligible cost.
But in LLM memory systems, this assumption no longer holds. When the AI loads context at the start of a session, what it sees is text. There is no intermediate layer that can run a decoding program to convert a compressed format back to natural language before passing it to the model. The compressed text must be directly understood as model input -- without any preprocessing, decoding, or conversion.
More specifically, this constraint eliminates any approach that requires running additional code outside the model's inference. The model's context window is the only "runtime environment." The compression format must be self-explanatory within this environment.
Elimination Analysis: What Approaches Cannot Work
With the four constraints clearly defined, we can systematically eliminate approaches that appear viable but inevitably violate at least one constraint. This elimination analysis is not meant to disparage the eliminated approaches -- many of them are highly effective in other contexts -- but to narrow the search space of feasible solutions.
Approach A: Binary Encoding
The most intuitive high-compression approach is binary encoding. Encoding text information into some compact binary format can theoretically achieve compression ratios far exceeding 30x. Protocol Buffers, MessagePack, CBOR, and similar formats are widely used in inter-system communication, with compression efficiency far exceeding text.
Violates Constraint 3. No large language model has been trained to understand binary formats. When you place a protobuf-encoded byte stream into GPT-4's context window, the model sees gibberish. It cannot extract any information from it, much like an English-only reader confronted with Chinese braille.
The root of the problem is that large language models' training corpora are text, and their world models are built on statistical learning of text patterns. Binary formats lie outside the coverage of these world models.
Approach B: JSON Compression
Since binary is out, what about organizing information as structured JSON? JSON is a text format, all models have seen large amounts of JSON training data, and understanding it is not a problem.
{
"team": {
"name": "Driftwood",
"lead": "Priya",
"members": [
{"name": "Kai", "role": "backend", "tenure": "3yr"},
{"name": "Soren", "role": "frontend"},
{"name": "Maya", "role": "infrastructure"},
{"name": "Leo", "role": "junior", "status": "new"}
]
},
"project": "saas_analytics",
"sprint": "auth_migration_to_clerk",
"decision": {
"by": "Kai",
"choice": "clerk_over_auth0",
"reasons": ["pricing", "developer_experience"]
}
}
Violates Constraint 1. Count the tokens: this JSON is approximately 180 tokens. The information it carries would require approximately 250 tokens in natural language. The compression ratio is less than 1.5x. JSON's syntactic overhead -- curly braces, square brackets, quotation marks, colons, commas, key names -- occupies substantial space, all of which are tokens the model must process but that carry no actual information.
You can optimize by shortening key names and removing indentation, but JSON's structural redundancy is inherent. In "name":, only the value is useful; the key's meaning can be inferred from context. But JSON format requires you to write out every key.
The deeper problem is that JSON was designed for machine parsing; its redundancy is intentional -- for explicitness and fault tolerance. What we need is a format designed for LLM reading. LLMs have contextual inference capabilities and do not need that much explicit annotation.
Approach C: LLM Summarization
Have a language model read the original text and output a concise summary. This is the approach adopted by most commercial AI memory systems.
Summary: Priya leads a team working on SaaS analytics.
Key members include backend developer Kai (3yr) and
junior developer Leo (new). Currently migrating auth to Clerk.
Violates Constraint 2. Summarization is the definition of lossy compression. In the summary above, Soren and Maya have completely disappeared. The specific decision attribution "Kai recommended Clerk over Auth0 based on pricing and DX" is also lost.
You can instruct the model to "not omit any details," but this merely changes the information loss from explicit to implicit. The model still must judge what constitutes a "detail" -- and that judgment itself is a lossy operation. Moreover, the more thorough the summary, the lower the compression ratio, ultimately approaching verbatim reproduction of the original.
Another fatal problem with summaries is irreversibility. When you discover that a summary omitted a critical detail, you cannot recover it from the summary itself -- you must return to the original text. This means a summary cannot serve as the sole compressed representation; at best it can serve as an index. But MemPalace already has a better indexing system (the palace structure) and does not need another lossy index to supplement it.
Approach D: Custom Encoding Table
Design a custom encoding table: assign short codes to common concepts and use a lookup table for encoding and decoding. Similar to Morse code, but oriented toward semantics rather than letters.
For example: T1=Priya T2=Kai T3=Soren R1=backend R2=frontend P1=Driftwood, then use T1(lead,P1) T2(R1,3yr) T3(R2) to represent team structure.
Violates Constraint 4. The model reading this text does not know that T1 is Priya or that R1 is backend. It needs an encoding table -- a decoder -- to understand these codes. And the encoding table itself is text that must be loaded into the context, consuming additional tokens and further reducing the effective compression ratio.
More seriously, the encoding table is bound to a specific dataset. A different set of people, a different project, requires a new encoding table. This turns the system into a stateful component requiring maintenance, violating MemPalace's design philosophy of "simple enough that it cannot go wrong."
Of course, if the encoding is sufficiently intuitive -- such as using PRI for Priya and KAI for Kai -- then even without an explicit encoding table, the model can infer what entity each code corresponds to. But at that point, the "encoding" is no longer an arbitrary symbol mapping but rather an abbreviation system based on natural language intuition. This distinction is important because it points toward the direction of feasible solutions.
Approach E: Vector Embedding Compression
Encode text as vector embeddings, using 384-dimensional or 768-dimensional float arrays to "memorize" semantic information.
Violates both Constraint 3 and Constraint 4. As discussed earlier, embedding vectors are meaningless numbers to other models. Moreover, recovering original information from embeddings requires a decoder (or at least a similarity matching engine), which also violates Constraint 4.
Vector embeddings have their place in MemPalace -- ChromaDB uses them to power semantic search -- but they are a retrieval tool, not a storage format. This distinction is strictly maintained in the design.
The Shape of Feasible Solutions
After eliminating five approaches, the constraint space narrows dramatically. Let us reverse-engineer the characteristics a feasible solution must possess from the elimination results:
Must be a text format -- because Constraint 3 requires readability by any text model, and the common capability intersection of all text models is understanding text.
Must be self-explanatory -- because Constraint 4 requires no decoder; the compressed text itself must carry enough context for a model to understand its meaning.
Must preserve factual assertions as completely as possible -- because Constraint 2 is about factual completeness; every entity, relationship, attribute, and event should have a corresponding representation after compression, even if today's heuristic compressor does not always achieve that ideal in full.
Must be extremely compact -- because Constraint 1 requires a 30x compression ratio; each token must carry information density far exceeding natural English.
Putting these four characteristics together, a conclusion begins to emerge: the feasible solution must be some form of extremely abbreviated natural language.
Not a newly invented encoding -- because that would require a decoder. Not an entirely new syntax -- because models have not seen it in their training data. It must be English (or more precisely, an extremely condensed form of any natural language), leveraging the language comprehension capabilities that large language models already possess to "decode" the abbreviations.
This derivation has an important property: it is not reverse-engineered from AAAK's design but forward-derived from the constraints. Even if MemPalace's designer had never existed, any engineer facing the same four constraints, conducting the same elimination analysis, would arrive at the same conclusion -- the feasible solution must be extremely abbreviated English.
Information Redundancy in Natural Language
Since the feasible solution must be based on natural language abbreviation, a natural follow-up question is: how much redundancy does natural language actually contain that can be removed?
Information theory provides a quantitative answer. Claude Shannon estimated in his 1951 experiment that the information entropy of English is approximately 1.0-1.5 bits per character, while the maximum entropy of the English alphabet is approximately 4.7 bits per character. This means approximately 70% of English text is redundant -- it exists to help humans process language (grammatical markers, function words, morphological inflections), not to convey independent information.
But Shannon's estimate was at the character level. At the word level, sources of redundancy are more diverse:
Grammatical redundancy. In "The team is currently working on the project," "the," "is," "currently," "on," and the second "the" are all grammatical function words that carry no information about the team, the work, or the project. Remove them, and "team working project" still conveys the core semantics.
Rhetorical redundancy. In "Kai, who has been with the team for three years and has extensive experience in backend development, recommended Clerk," the phrases "who has been with the team for" and "and has extensive experience in" are rhetorical embellishments. The core information is "Kai(backend,3yr) rec:Clerk."
Explanatory redundancy. When you say "Priya manages the Driftwood team," "manages" implies that Priya is the leader. In a compressed representation, "PRI(lead)" suffices -- the additional semantics conveyed by the verb "manage" (supervision, decision authority, reporting relationships) can be inferred by the model from the "lead" role label alone.
Narrative redundancy. Natural language tends toward linear narration -- background first, then events, then conclusions. A compressed representation can break this order, directly listing facts in a structured manner, letting the model reconstruct narratives as needed.
These redundancies combined provide ample room for the 30x compression required by Constraint 1. But the key point is: removing redundancy is not the same as losing information. Redundancy is extra packaging around information; removing the packaging does not change the contents, provided the receiver has the ability to reconstruct complete understanding from the bare contents.
Large language models happen to possess this ability. Trained on trillions of words of text, they have deeply internalized English grammar rules, semantic relationships, and world knowledge. When they see "KAI(backend,3yr)," they do not need anyone to tell them this means "Kai is a backend developer with 3 years of experience" -- their language model automatically completes this inference.
This is the core insight of the feasible solution: the large language model itself is the decoder. No external decoding program is needed, because the model's language comprehension capability is the decoding capability. Constraint 4 appears to eliminate all approaches requiring a decoder, but in reality, it only eliminates approaches requiring an external decoder -- approaches that leverage the model itself as the decoder perfectly satisfy this constraint.
From Constraints to Design Space
To summarize this chapter's derivation chain:
- Problem definition: Compress large volumes of contextual information into an extremely small token space for instant LLM loading.
- Constraint definition: 30x compression, zero loss, model-universal, no decoder required. All four constraints are indispensable.
- Elimination analysis: Binary (model-unreadable), JSON (insufficient compression ratio), LLM summarization (lossy), custom encoding (requires decoder), vector embeddings (model-unreadable and requires decoder) -- all eliminated.
- Forward derivation: The feasible solution must be text-format, self-explanatory, preserve all facts, and be extremely compact. The intersection: extremely abbreviated natural language.
- Theoretical foundation: Natural language's 70% redundancy provides room for high-ratio compression, while the LLM's language comprehension capability serves as an implicit decoder.
The significance of this derivation chain is that it repositions AAAK from a seemingly arbitrary "invention" to the only feasible design direction under strict constraints. AAAK was not designed because it was "clever," but because under the given four constraints, the solution space had only one remaining corner.
Of course, "extremely abbreviated natural language" is still a large design space. What to abbreviate, what to preserve, what symbols to use for marking structure -- these specific syntactic decisions still have considerable degrees of freedom. The next chapter will dive into AAAK's specific syntax design, analyzing every choice it makes within this narrowed design space.
But before that, it is worth remembering one thing: AAAK works not because its syntax is particularly ingenious, but because it correctly identified the shape of the constraint space. Good engineering never starts from the solution -- it starts from the constraints.
Chapter 9: The Grammar Design of AAAK
Positioning: The previous chapter derived through constraint satisfaction analysis that a viable compression scheme must be "extremely abbreviated natural language." This chapter enters the concrete syntax level: what abbreviation rules AAAK uses, what delimiters, what tagging system, and how these choices are implemented in
dialect.py. From here on, we will see real code and real compression comparisons.
From English to AAAK: A Complete Comparison
Before analyzing the grammar rules, let us look at a complete comparison example. This is the core demonstration from the MemPalace README:
English original (~1000 tokens):
Priya manages the Driftwood team: Kai (backend, 3 years),
Soren (frontend), Maya (infrastructure), and Leo (junior,
started last month). They're building a SaaS analytics platform.
Current sprint: auth migration to Clerk. Kai recommended Clerk
over Auth0 based on pricing and DX.
AAAK compressed (structured shorthand example):
TEAM: PRI(lead) | KAI(backend,3yr) SOR(frontend) MAY(infra) LEO(junior,new)
PROJ: DRIFTWOOD(saas.analytics) | SPRINT: auth.migration->clerk
DECISION: KAI.rec:clerk>auth0(pricing+dx) | ****
The original -- a natural language paragraph of approximately 250 characters -- is compressed into approximately 180 characters of AAAK representation. But the compression ratio at the token level is even more significant, because AAAK's structured format is more tokenizer-friendly -- the English original's articles, prepositions, and conjunctions each occupy individual tokens, while in AAAK all such redundancy has been removed.
More critical is the completeness of information. Let us verify item by item:
| Fact in English original | Corresponding representation in AAAK |
|---|---|
| Priya is the team lead | PRI(lead) |
| Kai does backend, 3 years of experience | KAI(backend,3yr) |
| Soren does frontend | SOR(frontend) |
| Maya does infrastructure | MAY(infra) |
| Leo is junior, just started | LEO(junior,new) |
| Team is called Driftwood | DRIFTWOOD |
| Building a SaaS analytics platform | saas.analytics |
| Current sprint is auth migration | SPRINT: auth.migration |
| Migration target is Clerk | ->clerk |
| Kai recommended Clerk | KAI.rec:clerk |
| Clerk preferred over Auth0 | clerk>auth0 |
| Reasons are pricing and developer experience | (pricing+dx) |
| This is an important decision | **** |
In this short structured example, thirteen factual assertions are preserved. The token count drops from roughly 70 in the original to roughly 35 after compression -- about a 2x ratio here. But that example should not be overgeneralized. In the current repository, dialect.compress() is a heuristic plain-text compressor: it selects key sentences, extracts top topics, and truncates entities / emotions / flags. For long, redundant conversation logs, structured shorthand plus content selection can drive the ratio much higher, which is where the README's 30x figure comes from.
Six Core Grammar Elements
AAAK's grammar can be decomposed into six core elements. Each corresponds to a specific compression strategy.
Element 1: Three-Letter Entity Encoding
This is AAAK's most fundamental grammar rule: personal names, project names, and other named entities are abbreviated to three uppercase letters.
Priya -> PRI Kai -> KAI Soren -> SOR
Maya -> MAY Leo -> LEO Driftwood -> DRI
The rule is simple: take the first three characters of the name and convert to uppercase. The implementation in dialect.py is as follows:
# dialect.py:378-379
def encode_entity(self, name: str) -> Optional[str]:
...
# Auto-code: first 3 chars uppercase
return name[:3].upper()
This implementation lives in the Dialect.encode_entity method (dialect.py:367-379). The method first checks for a predefined entity mapping (via the entities parameter passed to the constructor); if none exists, it falls back to the "take first three characters" auto-encoding strategy.
The choice of three letters is not arbitrary. Two letters (such as PR, KA) have too high a collision probability -- 26^2 = 676 combinations can easily produce ambiguity in a system with dozens of entities. Four letters (such as PRIY, KAIS) have diminishing returns -- the extra character's disambiguation benefit is not worth the token space it occupies at every occurrence. Three letters (26^3 = 17,576 combinations) strike the best balance between disambiguation and compactness.
More importantly, three-letter encoding maintains an intuitive association with the original name. When a model sees PRI, if Priya has appeared in the context, it can immediately make the connection. This is precisely the "self-explanatory" requirement discussed in Chapter 8: the encoding itself carries sufficient semantic cues without requiring an external encoding table.
Element 2: Pipe Delimiter
AAAK uses the vertical bar | as a field separator, replacing commas, periods, and line breaks in natural language.
0:PRI+KAI|backend_auth|"switched to Clerk"|determ+convict|DECISION
This structure is built in dialect.py's compress method (dialect.py:539-602). The method takes detected entities, topics, key quotes, emotions, and tags as separate fields, joined by pipe characters:
# dialect.py:600-602
parts = [f"0:{entity_str}", topic_str]
if quote_part:
parts.append(quote_part)
...
lines.append("|".join(parts))
The choice of pipe character has two engineering rationales. First, it rarely appears in natural language, so it does not create ambiguity with the content itself -- unlike the comma, which serves as both delimiter and English punctuation. Second, large language models have seen extensive pipe-delimited formats in training data (command-line output, Markdown tables, log files) and have already learned to interpret | as "field boundary."
Element 3: Arrow Causality
AAAK uses -> to represent causal, directional, or transitional relationships:
auth.migration->clerk # migration direction
fear->trust->peace # emotional arc
KAI.rec:clerk>auth0 # recommendation (Clerk preferred over Auth0)
The semantics of arrows vary slightly across contexts: in action contexts they indicate direction (from A to B), in emotional arcs they indicate temporal progression (first fear, then trust, finally peace), and in comparisons they indicate preference. But the core meaning is always "left-to-right flow" -- a metaphor understood by virtually all cultures and all language models.
Emotional arcs are marked with the ARC: prefix in dialect.py (dialect.py:742), allowing the model to directly read an individual's emotional trajectory from ARC:fear->trust->peace.
Element 4: Star Importance Markers
AAAK uses one to five stars to mark the importance level of information:
DECISION: KAI.rec:clerk>auth0(pricing+dx) | ****
The elegance of this marking system lies in its cognitive transparency. Anyone (and any model) seeing four stars immediately knows "this is important," without needing an explanation of what "importance level 4" means. Stars are defined in the MCP server's AAAK specification (mcp_server.py:109):
IMPORTANCE: * to ***** (1-5 scale).
In dialect.py's zettel encoding path, importance is expressed through an emotional_weight value (0.0-1.0) (dialect.py:697), while at the AAAK specification level, stars provide a more intuitive alternative representation.
Element 5: Emotion Tags
AAAK uses short emotion codes to tag the emotional tone of text. dialect.py defines a complete emotion encoding table (dialect.py:47-88):
# dialect.py:47-52 (excerpt)
EMOTION_CODES = {
"vulnerability": "vul",
"joy": "joy",
"fear": "fear",
"trust": "trust",
"grief": "grief",
"wonder": "wonder",
...
}
The encoding rule takes the first three to four characters of the emotion word as an abbreviation: vulnerability becomes vul, tenderness becomes tender, exhaustion becomes exhaust. The encode_emotions method (dialect.py:381-388) converts an emotion list into a compact +-joined string, keeping at most three emotion tags:
# dialect.py:381-388
def encode_emotions(self, emotions: List[str]) -> str:
codes = []
for e in emotions:
code = EMOTION_CODES.get(e, e[:4])
if code not in codes:
codes.append(code)
return "+".join(codes[:3])
Emotion tags might seem superfluous in an AI memory system -- why should memory record emotions? But MemPalace's designer clearly recognized that the decision-making context of humans often includes an emotional dimension. "We chose Clerk under extreme anxiety" and "We calmly chose Clerk after thorough deliberation" convey not just emotional differences but also signals about decision quality. When the model is later asked "Was that decision sound?", emotion tags provide additional basis for judgment.
In the MCP server, the AAAK specification uses a more expressive emotion tagging syntax (mcp_server.py:107):
EMOTIONS: *action markers* before/during text.
*warm*=joy, *fierce*=determined, *raw*=vulnerable, *bloom*=tenderness.
These asterisk-wrapped action markers are closer to "stage directions" in literary writing -- they do not merely annotate "there is sadness here" but rather "the tone here is vulnerable." This provides tone cues for the AI when recalling past events.
Element 6: Semantic Flags
AAAK defines a set of fixed flags to mark the type and nature of information (dialect.py:29-36):
ORIGIN = origin moment (the birth of something)
CORE = core belief or identity pillar
SENSITIVE = requires extremely careful handling
PIVOT = emotional turning point
GENESIS = directly led to something that currently exists
DECISION = an explicit decision or choice
TECHNICAL = technical architecture or implementation detail
The _FLAG_SIGNALS dictionary (dialect.py:117-152) defines mapping rules from natural language keywords to flags:
# dialect.py:117-125 (excerpt)
_FLAG_SIGNALS = {
"decided": "DECISION",
"chose": "DECISION",
"switched": "DECISION",
"founded": "ORIGIN",
"created": "ORIGIN",
"turning point": "PIVOT",
"core": "CORE",
...
}
When the compression engine detects keywords like "decided," "chose," or "switched" in the text, it automatically adds the DECISION flag. These flags function like index tags in a database -- they do not change the content itself but greatly accelerate subsequent retrieval and filtering. When the AI is asked "What important decisions have we made?", it only needs to filter for the DECISION flag rather than performing semantic matching across all memories.
The Compression Pipeline: From Raw Text to AAAK
Having understood the six grammar elements, let us see how they are assembled in the Dialect.compress method. This is MemPalace's core entry point for converting arbitrary text to AAAK (dialect.py:539-602).
The compression pipeline consists of five stages:
Stage 1: Entity detection. The _detect_entities_in_text method (dialect.py:510-537) searches the text for known entities (via predefined mappings) or automatically detects capitalized words as potential entity names. Detected entities are encoded as three-letter codes joined by +.
Stage 2: Topic extraction. The _extract_topics method (dialect.py:430-455) extracts key topic words from the text. Its strategy is word frequency counting, with weighted treatment of capitalized words (potentially proper nouns) and words containing underscores or hyphens (potentially technical terms). A stopword list (dialect.py:155-289) ensures that "the," "is," "was," and other uninformative words do not pollute topic extraction results.
Stage 3: Key sentence extraction. The _extract_key_sentence method (dialect.py:457-508) selects the most "important" sentence fragment from the text as a quote. The scoring criteria favor short sentences containing decision words ("decided," "because," "instead") -- these sentences typically carry the highest information density.
Stage 4: Emotion and flag detection. _detect_emotions (dialect.py:408-417) and _detect_flags (dialect.py:419-428) respectively detect emotional tone and semantic flags through keyword matching.
Stage 5: Assembly. All detected components are assembled into pipe-delimited AAAK format lines:
# dialect.py:596-602
parts = [f"0:{entity_str}", topic_str]
if quote_part:
parts.append(quote_part)
if emotion_str:
parts.append(emotion_str)
if flag_str:
parts.append(flag_str)
lines.append("|".join(parts))
If metadata is present (source file, wing, room, date), a header line is prepended before the content line (dialect.py:583-589).
graph TD
A[Raw Text] --> B[Entity Detection & Encoding]
B --> C[Topic Extraction]
C --> D[Key Sentence Selection]
D --> E[Emotion/Flag Detection]
E --> F[AAAK Assembly Output]
The design philosophy of the entire pipeline is: better to over-preserve than under-preserve. Each stage has upper limits (at most 3 emotions, at most 3 flags, at most 3 topics), but no lower limits -- if fewer items are detected than the upper limit, all are preserved.
Calculating the Compression Ratio
The Dialect class provides a compression_stats method (dialect.py:936-946) to quantify compression effectiveness:
# dialect.py:932-934
@staticmethod
def count_tokens(text: str) -> int:
"""Rough token count (1 token ~ 3 chars for structured text)."""
return len(text) // 3
This token count uses the approximation "every 3 characters equals roughly 1 token" -- a reasonable estimate for structured text (natural English is approximately 4 characters per token, but AAAK's uppercase letters and symbols make tokenization denser).
In practice, the compression ratio depends on the nature of the original text. Purely narrative conversation logs (filled with expressions like "well, I think that maybe we should consider...") can achieve compression ratios above 30x. Structured technical descriptions (already fairly compact) typically achieve 5-10x compression. The 30x figure claimed in the MemPalace README is representative of typical conversation logs.
Zettel Encoding: A Compression Path for Structured Data
Beyond plain text compression, dialect.py also maintains an encoding path for structured zettel data. This is AAAK's original design -- encoding memory entries that have already been decomposed into zettel (card) format.
The encode_zettel method (dialect.py:681-710) processes a single zettel card:
# dialect.py:681-685
def encode_zettel(self, zettel: dict) -> str:
zid = zettel["id"].split("-")[-1]
entity_codes = [self.encode_entity(p) for p in zettel.get("people", [])]
entity_codes = [e for e in entity_codes if e is not None]
...
The output format follows the specification defined in dialect.py's header comments (dialect.py:15-18):
Header: FILE_NUM|PRIMARY_ENTITY|DATE|TITLE
Zettel: ZID:ENTITIES|topic_keywords|"key_quote"|WEIGHT|EMOTIONS|FLAGS
Tunnel: T:ZID<->ZID|label
Arc: ARC:emotion->emotion->emotion
The encode_file method (dialect.py:720-751) encodes a complete zettel JSON file (containing multiple zettels and their tunnel connections) into a multi-line AAAK text block. The header line contains the file number, primary entity, date, and title, followed by the encoded line for each zettel and tunnel connection lines.
These two paths -- plain text compression and zettel encoding -- serve different use cases. Plain text compression (the compress method) is used for real-time processing of new input, while zettel encoding (encode_zettel / encode_file) is used for processing pre-processed and structured historical data.
The Delivery Mechanism for the AAAK Specification
No matter how elegant a grammar design is, it is useless if the model does not know its rules. AAAK solves this problem in a surprisingly direct way: embedding the complete specification text in the MCP server's status response.
mcp_server.py defines an AAAK_SPEC constant (mcp_server.py:102-119):
# mcp_server.py:102-103
AAAK_SPEC = """AAAK is a compressed memory dialect
that MemPalace uses for efficient storage.
It is designed to be readable by both humans
and LLMs without decoding.
...
"""
This specification is embedded in the mempalace_status tool's response (mcp_server.py:85-86):
# mcp_server.py:84-86
return {
...
"protocol": PALACE_PROTOCOL,
"aaak_dialect": AAAK_SPEC,
}
This means that when the AI explicitly calls mempalace_status, and a palace collection already exists, it receives the complete AAAK grammar specification in the response. From that point on, it knows how to read and write AAAK. The precondition matters: if the palace has not been initialized yet, status returns _no_palace() guidance instead of the AAAK spec payload.
The brilliance of this design lies in the fact that the specification itself is also natural language text. The model does not need to "learn" a new encoding -- it simply reads a description about this encoding, just as a human reads a format specification document. The AAAK specification can describe itself using AAAK's own terminology, a recursive self-consistency.
In testing, Claude, GPT-4, Gemini, and other models correctly read and generate AAAK text after seeing the AAAK specification for the first time. No fine-tuning, no few-shot examples, no iterative training required. This validates Chapter 8's core argument: AAAK is not a new language but an extremely abbreviated form of English, and models' existing language capabilities are sufficient to "decode" it.
A Model's First Contact
To illustrate more concretely this property of "readable at first sight," consider the following scenario:
A model that has never seen AAAK receives this text:
TEAM: PRI(lead) | KAI(backend,3yr) SOR(frontend) MAY(infra) LEO(junior,new)
Even without any specification, the model can infer: this describes a team; PRI is an abbreviation of someone's name, likely the team lead; KAI does backend with 3 years of experience; SOR does frontend; and so on. Because these abbreviations and structures leverage universal patterns of English -- parentheses contain attributes, commas separate attributes, uppercase abbreviations are names.
And when the model simultaneously receives the AAAK specification, understanding becomes even more certain: it no longer needs to "guess" that PRI is a name abbreviation, because the specification explicitly states "ENTITIES: 3-letter uppercase codes."
This property of "roughly understandable even without the specification, precisely understandable with it" is the key achievement of AAAK's design. It makes the compression format "work" on two levels: at the language intuition level (leveraging the model's language comprehension), and at the specification level (eliminating ambiguity through explicit format description).
Automatic Detection of Emotional Signals
A noteworthy design in dialect.py is the automatic emotion detection mechanism. The _EMOTION_SIGNALS dictionary (dialect.py:91-114) maps everyday English emotional keywords to AAAK emotion codes:
# dialect.py:91-99 (excerpt)
_EMOTION_SIGNALS = {
"decided": "determ",
"prefer": "convict",
"worried": "anx",
"excited": "excite",
"frustrated": "frust",
"love": "love",
"hope": "hope",
...
}
This means when you write "I'm worried about the deadline," the compression engine automatically detects "worried" and tags anx (anxiety). You do not need to manually annotate emotions -- the system infers them from the text itself.
Similarly, _FLAG_SIGNALS (dialect.py:117-152) automatically adds semantic flags through keyword detection. "decided to use GraphQL" triggers DECISION, "this was a turning point" triggers PIVOT, "I created a new repo" triggers ORIGIN.
This keyword-based detection is obviously not perfect -- it misses euphemisms ("I'm not sure this is the right approach" will not trigger doubt), and may misjudge ("I love this bug" where "love" is clearly sarcastic). But in MemPalace's design philosophy, coarse-grained automatic detection is better than no detection. Even if only 60% of emotional signals are tagged, that 60% still provides a valuable filtering dimension for subsequent retrieval.
Layer 1 Generation: Global Compression
The most complex method in dialect.py is generate_layer1 (dialect.py:784-902), which extracts the most critical memories from all zettel files and generates a compressed "Layer 1" wake-up file.
The logic of this method has three steps:
- Filtering: Iterate through all zettels, keeping only entries with emotional weight above a threshold (default 0.85) or entries carrying
ORIGIN,CORE, orGENESISflags. - Grouping: Group filtered entries by date, generating
=MOMENTS[date]=sections. - Encoding: Apply AAAK encoding to each entry, outputting compact pipe-delimited lines.
Output example:
## LAYER 1 -- ESSENTIAL STORY
## Auto-generated from zettel files. Updated 2026-04-07.
=MOMENTS[2025-11]=
PRI+KAI|auth decision|"chose Clerk for DX and pricing"|0.92|DECISION
KAI|backend architecture|"GraphQL over REST"|0.88|DECISION+TECHNICAL
=MOMENTS[2025-12]=
LEO|onboarding|"first PR merged"|0.85|ORIGIN
=TUNNELS=
auth decision connects KAI and PRI
More accurately, this is one kind of Layer 1 artifact that the AAAK toolchain can generate -- a way to condense months of critical history into fewer than 120 tokens. What must be kept separate from that is the current default runtime path: the public repository's mempalace wake-up still goes through layers.py and typically outputs roughly 600-900 tokens of L0 + L1 text rather than directly loading generate_layer1() output.
A Summary of the Grammar Design
Looking back at AAAK's six core grammar elements, a pattern becomes clear: every element follows the same principle -- leverage the model's existing language intuitions, rather than inventing new rules the model must learn.
- Three-letter encoding leverages the intuition that "uppercase abbreviation = name"
- Pipe delimiting leverages the intuition that "vertical bar = field boundary"
- Arrows leverage the intuition that "left to right = cause/direction"
- Stars leverage the intuition that "more stars = more important"
- Emotion codes leverage the intuition that "abbreviation = original word"
- Flags leverage the intuition that "uppercase word = label"
Not a single element requires the model to "learn" new semantics. They are all built on patterns that appear extensively in model training data. This is not coincidence but rather a direct consequence of the constraints derived in Chapter 8: when you cannot use a decoder, your only "decoder" is the model's existing knowledge. And the most reliable way to leverage existing knowledge is to use patterns it already understands.
The next chapter will explore a far-reaching consequence of this design choice: because AAAK is merely abbreviated English rather than some model-specific encoding, it inherently possesses cross-model universality.
Chapter 10: Cross-Model Universality
Positioning: The previous two chapters derived AAAK's design constraints and specific grammar, respectively. This chapter addresses a deeper question: why can AAAK be understood by any text model? This is not an accidental side effect but a necessary result explainable from a linguistic perspective. And what product implications does this technical property carry -- the decoupling of the memory system from model vendors.
An Unexpected Discovery
In early 2026, when MemPalace's benchmark results were published, the tech community's first reaction focused on the scores themselves -- a perfect score on LongMemEval, 96.6% with zero API calls. But a question that subsequently emerged may deserve more attention: are these results tied to Claude? If you switched to GPT-4, Gemini, or a completely offline Llama model, would AAAK still work?
MemPalace's README gave an answer so concise it bordered on provocative:
Works with: Claude, ChatGPT, Gemini, Llama, Mistral -- any model that reads text.
"Any model that reads text." This is not marketing hyperbole but a technical assertion that can be rigorously argued from design principles. Understanding why it holds requires returning to the fundamental level of linguistics.
The Linguistic Explanation: Omission Rather Than Invention
AAAK's fundamental difference from other AI-specific formats is that it invents nothing new.
Consider some formats that require specific models or specific training to understand:
- Vector embeddings: A 768-dimensional vector generated by Model A is a meaningless sequence of numbers to Model B. Because embedding spaces are products of model training, there is no correspondence between different models' embedding spaces.
- Special tokens: Some systems use custom tokens (such as
<memory>or[FACT]) to mark information types. These tokens are only effective in models trained to recognize them. - Function call formats: OpenAI's function call syntax (JSON-structured function_call) can only be correctly processed by models trained to support this format.
The common thread among these formats is that they add a new layer of convention on top of natural language. The receiver must "know" this convention in advance to correctly interpret the content.
AAAK takes the opposite direction. It does not add new rules on top of English but removes redundancy from English. Let us analyze this process through a concrete example:
First layer of removal -- function words:
Original: Priya manages the Driftwood team
First layer: Priya manages Driftwood team
Removed: the
The article "the" functions here to mark "Driftwood team" as definite. But when "Driftwood" is itself a proper noun, this marker is entirely redundant -- there is no "other Driftwood team" to distinguish.
Second layer of removal -- verb morphology:
First layer: Priya manages Driftwood team
Second layer: Priya lead Driftwood
Removed: manages (replaced with role label lead), team (inferable from context)
The verb "manages" tells us Priya holds a leadership role. Directly substituting the role label lead preserves the information with fewer tokens. "Team" can be inferred from AAAK's TEAM: prefix.
Third layer of removal -- full names:
Second layer: Priya lead Driftwood
Third layer: PRI(lead) DRIFTWOOD
Removed: riya (the trailing characters of the name)
The first three letters "PRI" of "Priya" are sufficient to uniquely identify this person in the given context. The subsequent "ya" carries no additional distinguishing information.
Every layer removes components of English that "help humans process but carry no independent information." Not a single step involves inventing new grammar rules or symbol systems. The final result PRI(lead) DRIFTWOOD is still -- from a linguistic standpoint -- English. An extremely condensed, grammatically stripped-down English, but English nonetheless.
This is the fundamental reason for AAAK's cross-model universality: any model that can read English can read AAAK, because AAAK is English.
Linguistic Boundaries of Omission
Of course, not all omissions are safe. Linguistics draws a critical distinction: recoverable ellipsis and unrecoverable ellipsis.
Recoverable ellipsis means the receiver can reconstruct the omitted component from context. "Who's coming?" "Me." -- here "am coming" is omitted from "I am coming," but the listener can recover it without difficulty.
Unrecoverable ellipsis means the omitted information cannot be inferred from context. If the original text is "Kai has been working with Priya for three years on backend systems, and he recommended Clerk because of its superior developer experience and competitive pricing," compressing it to "KAI rec Clerk" is unrecoverable ellipsis -- the three-year working relationship and the specific reasons for the recommendation are lost.
AAAK's grammar design strictly limits itself to the domain of recoverable ellipsis. KAI(backend,3yr) omits "is a," "developer," "with," and "of experience" from "Kai is a backend developer with 3 years of experience," but all these words can be automatically recovered by the model from the structure of (backend,3yr). KAI.rec:clerk>auth0(pricing+dx) omits substantial vocabulary from "Kai recommended Clerk over Auth0 based on pricing and developer experience," but every omitted component can be inferred from the preserved structure: .rec = recommended, : = object separator, > = over/preferred to, (pricing+dx) = based on pricing and developer experience.
And unrecoverable information -- such as Kai and Priya having worked together for three years -- is not omitted; it is preserved in the form of 3yr.
This principle of "only performing recoverable ellipsis" corresponds directly to Chapter 8's Constraint 2: factual completeness as a design ideal. Stated in linguistic terms, that means omissions are acceptable only when they remain recoverable from structure.
Guaranteeing Recoverability: Who Does the Recovering?
Recoverable ellipsis has a critical prerequisite: there must be a "recoverer" with sufficient knowledge to reconstruct the omitted components. In human conversation, this recoverer is the listener, relying on shared linguistic and world knowledge. In AAAK's scenario, this recoverer is the large language model.
The question then becomes: do different models possess the same recovery ability?
The answer is: for the types of omission AAAK uses -- yes.
The recovery abilities AAAK depends on can be divided into three categories:
Grammatical recovery -- recovering "Kai is a backend developer" from KAI(backend). This requires the model to understand the pattern "name followed by parentheses indicates attributes." All major large language models have seen extensive instances of this pattern in training data (function calls in programming languages, parameter descriptions in technical documentation, attribute listings in encyclopedia entries), making this a universal text comprehension capability.
Lexical recovery -- recovering "recommended" from rec, "3 years" from 3yr. This requires the model to understand common English abbreviation patterns. Again, all models have seen extensive abbreviations in training data (msg = message, yr = year, info = information), making this a fundamental language capability.
Structural recovery -- recovering "there is a team, PRI is the lead, KAI does backend" from TEAM: PRI(lead) | KAI(backend). This requires the model to understand pipe delimiting and hierarchical structure. As discussed earlier, pipe delimiting appears extensively in command-line output, Markdown tables, and similar formats, and all models possess this understanding.
These three categories of recovery ability share a common characteristic: they do not depend on any specific model's proprietary training. They are universal capabilities learned from general text corpora. Any model trained on Common Crawl, Wikipedia, GitHub, and other standard corpora -- that is, any mainstream large language model -- possesses these capabilities.
This is why MemPalace can confidently claim AAAK "works with any model that reads text." Not because compatibility testing was performed on every model, but because the capabilities it depends on represent the greatest common denominator of all models.
Product Implications: Decoupling
Technical properties do not create value in themselves -- value comes from the product possibilities that technical properties enable. The most important possibility enabled by AAAK's cross-model universality is: complete decoupling of the memory system from model vendors.
What Decoupling Means
In the current AI application ecosystem, memory is typically deeply bound to the model. Using OpenAI's memory feature means your memories are stored on OpenAI's servers, accessible only through OpenAI's API. Switching to Claude means starting from zero -- your history, your preferences, the context you accumulated over six months, all reset.
This binding is a form of implicit lock-in. You might want to switch models because a competitor is cheaper, faster, or better at a particular task, but the switching cost is too high -- not the technical cost, but the knowledge cost. Everything your AI assistant "knows" about you is locked inside one vendor's walled garden.
AAAK breaks this binding. Because AAAK text can be understood by any model, your memories become a portable asset. Use Claude today, GPT-4 tomorrow, a local Llama the day after -- as long as you provide the same AAAK text to the new model, it can recover the same compact representation of your context. The README uses roughly 120 tokens to describe that compressed context size; that is AAAK's target advantage as a format, not the measured default output of today's wake-up path.
This is not merely a theoretical possibility. MemPalace's architecture already supports the underlying portability claim: ChromaDB is stored locally, AAAK itself is plain text, and MCP tools plus CLI search are not bound to any specific model. More precisely, the current repository has already proven "AAAK as portable cross-model text," while the default wake-up entry point still mainly emits raw-text L0 + L1; once AAAK is fully wired into the wake-up path, switching friction falls even further.
A Fully Offline Memory Stack
AAAK's cross-model universality, combined with MemPalace's local architecture, produces an even more radical possibility: once local dependencies and default embedding assets are prepared, the entire memory stack can run completely offline.
Consider this tech stack:
- Storage: ChromaDB, running on the local machine, no network required
- Model: Llama 3 or Mistral, running locally via llama.cpp or Ollama
- Compression: AAAK, pure Python implementation, zero external dependencies
- Search: ChromaDB's built-in vector search, local embedding model
More precisely: once local dependencies and default embedding assets are already in place, ingestion, storage, search, and context loading no longer require internet connectivity. No required API calls, no required cloud services, no mandatory data leaving your machine.
This property holds extremely high value for specific user groups. Enterprises handling sensitive data (legal, medical, financial) cannot send memory data to third-party servers. Security researchers do not want their work records appearing in anyone's training data. Developers working in environments with unreliable networks need a memory system that does not depend on network connectivity.
For these scenarios, "offline" is not a degraded option -- it is a hard requirement. AAAK makes satisfying this requirement possible -- not by sacrificing functionality, but through an architecture that by design needs no online components.
No Vendor Lock-in
Combining the previous two points, AAAK achieves genuinely vendor lock-in-free operation:
No lock-in at the data layer. Memories are stored in local ChromaDB; you have complete control over the data. Export is a simple file copy operation.
No lock-in at the format layer. AAAK is a plain text format that depends on no proprietary encoding or decoding tools. You can open AAAK files with any text editor; while not ideally suited for human reading, they are fully readable.
No lock-in at the model layer. AAAK text can be understood by any model. Switching from Claude to GPT requires no data conversion or format migration.
No lock-in at the runtime layer. The MCP server is a standard JSON-RPC interface. CLI tools are standalone command-line programs. The Python API is an open-source library. You can run the entire stack in any Python-capable environment.
This full-stack absence of lock-in is uncommon. Most AI memory products have binding at least at one level -- Mem0 is bound to its cloud service, Zep is bound to Neo4j and its API, OpenAI Memory is bound to OpenAI's ecosystem. MemPalace is replaceable at every level, making it a true infrastructure component rather than a service with lock-in.
Boundaries of Universality
In the interest of honesty, AAAK's cross-model universality has its boundaries.
Capability floor. AAAK relies on models possessing basic English comprehension and abbreviation recovery abilities. An extremely small language model (e.g., one with fewer than 1 billion parameters) may not be able to reliably perform these inferences. In practice, models with 7B parameters or more can typically understand AAAK reliably, but smaller models may experience degradation.
Cultural assumptions. AAAK's abbreviation rules are primarily based on English. While structures like PRI(lead) are understandable to any model that has seen English text, if the original content is in Chinese or Japanese, three-letter entity encoding (based on the first three characters of English names) does not apply well. MemPalace mitigates this through configurable entity mappings (Dialect.from_config), but AAAK's core syntax assumptions remain English-first.
Compression granularity. AAAK's automatic compression is based on keyword detection and word frequency analysis -- heuristic methods that perform less well on certain text types. Highly technical text (filled with code snippets and mathematical formulas) or highly emotional text (more metaphors and allusions than direct statements) may not be suitable for automatic compression and may require manual adjustment or selective original text preservation.
These boundaries are real, but they do not change the core argument: for mainstream large language models and English-dominant use cases, AAAK's cross-model universality is a reliable technical property supported by linguistic theory.
A Broader Insight
This chapter's analysis reveals an insight that extends beyond MemPalace itself: data format design for AI should leverage models' existing capabilities rather than require models to acquire new capabilities.
This principle sounds obvious, but it is frequently violated in practice. Every time a system designer invents a new markup syntax, customizes a set of token protocols, or requires a model to output in a specific format, they are asking the model to do something it may not have been trained to do. Some models can learn it, some cannot. Some versions handle it; upgrades may not.
AAAK's strategy is the opposite: demand nothing new from the model; simply present information in a form the model already knows how to process. This gives AAAK natural robustness against model upgrades, switches, and version differences -- because it depends not on a model's specific behavior but on its general capabilities.
This design principle can be generalized: if you are designing a system that needs to interact with multiple LLMs, your format choice should favor "a subset of natural language" over "a superset of natural language." Omission is safer than addition. Abbreviation is more reliable than invention. Leverage the model's existing knowledge rather than test its learning limits.
MemPalace demonstrated this principle with AAAK: a "language" that invented no new grammar became the most universally compatible AI memory format. Not because it was complex enough to cover all needs, but because it was simple enough to be understood by all models.
This simplicity is not the starting point of the design but the endpoint of constraint satisfaction. Chapter 8's four constraints filtered out all complex approaches like a funnel, and what remained was this -- an extremely abbreviated English, understandable by any model that reads text, requiring no decoder, no training, no special anything.
Sometimes, the best design is what remains after removing everything unnecessary.
Chapter 11: Temporal Knowledge Graph
Positioning: The opening chapter of Part 4, "The Time Dimension." Starting from MemPalace's knowledge graph source code, this chapter analyzes the design philosophy of temporal triples: facts are not eternal -- they have lifecycles. This chapter is the foundation for understanding Chapter 12's contradiction detection and Chapter 13's timeline narration.
Facts Expire
Before discussing the technical implementation, let us address the most fundamental cognitive issue.
"Kai is working on the Orion project." This statement was a fact in June 2025. By March 2026, Kai had moved to another project, and the statement was no longer true. But in a traditional knowledge graph, this record still sits quietly, with no one informing the system that it has expired. The next time someone asks "What project is Kai working on now?", the system confidently delivers a wrong answer.
This is not hypothetical. This is a real problem faced by virtually all knowledge systems based on static triples. In Wikipedia's infoboxes, tens of thousands of expired facts await manual updates by human volunteers every day. In enterprise knowledge bases, project assignments, personnel responsibilities, and technology stack choices -- these pieces of information typically have a shelf life measured in weeks, while update frequency is measured in months or even years.
MemPalace's response to this problem is: do not pretend facts are eternal. Give every fact an explicit time window.
This is the core idea of the Temporal Knowledge Graph.
Static KG vs. Temporal KG
A traditional knowledge graph stores triples: subject-predicate-object. For example, (Kai, works_on, Orion). This triple expresses a relationship, but it does not answer three critical questions:
- When did this relationship begin?
- Is this relationship still valid now?
- At a specific historical point in time, was this relationship valid?
A static KG cannot answer these questions because its data model simply has no time dimension. All you can do is overwrite old values (losing history) or append new values (creating contradictions).
A temporal KG adds two timestamps to triples: valid_from (effective date) and valid_to (expiration date). Multiple relationships of the same type can exist between the same subject and object, each covering a different time period. This transforms the knowledge graph from a static snapshot into a chronicle.
A comparison table:
| Capability | Static KG | Temporal KG |
|---|---|---|
| Store current facts | Yes | Yes |
| Store historical facts | Overwrite loses them | Fully preserved |
| Answer "Is X true now?" | Yes (but may be stale) | Precisely |
| Answer "Was X true in Jan 2025?" | No | Yes |
| Detect expired information | No | Yes |
| Support timeline narration | No | Yes |
MemPalace chose temporal KG. The direct consequence of this choice is: every fact entering the knowledge graph must carry temporal information, and every query result leaving the knowledge graph can be filtered by time.
Schema Design
Opening knowledge_graph.py, the complete database structure can be seen from the _init_db() method (knowledge_graph.py:55). MemPalace's temporal KG is built on two SQLite tables.
entities table
CREATE TABLE IF NOT EXISTS entities (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
type TEXT DEFAULT 'unknown',
properties TEXT DEFAULT '{}',
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
(knowledge_graph.py:58-63)
The entity table design is minimalist. id is the normalized form of the entity name (lowercase, spaces replaced by underscores), name preserves the original display name, type marks the entity type (person, project, tool, concept), and properties is a JSON field for storing additional entity attributes (such as birthday, gender, etc.).
Note that the type field defaults to 'unknown'. This means entities can be created without complete type information -- the system will not refuse to store a relationship because of missing metadata. This is a classic "tolerant input" design: store the information first, fill in type information later.
triples table
CREATE TABLE IF NOT EXISTS triples (
id TEXT PRIMARY KEY,
subject TEXT NOT NULL,
predicate TEXT NOT NULL,
object TEXT NOT NULL,
valid_from TEXT,
valid_to TEXT,
confidence REAL DEFAULT 1.0,
source_closet TEXT,
source_file TEXT,
extracted_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (subject) REFERENCES entities(id),
FOREIGN KEY (object) REFERENCES entities(id)
);
(knowledge_graph.py:65-78)
This is the core of the temporal KG. Beyond the standard subject/predicate/object triple fields, five additional fields deserve individual examination:
valid_from and valid_to -- the time window. Both fields allow NULL. A NULL valid_from means "unknown when it started" (not "has existed since the beginning of time"); a NULL valid_to means "currently still valid." This convention is crucial: by checking valid_to IS NULL, the system can immediately distinguish current facts from historical facts.
confidence -- confidence level, defaulting to 1.0 (fully certain). This field leaves room for future probabilistic reasoning. When a fact comes from a less reliable source (such as a relationship inferred from casual conversation), the confidence can be set below 1.0.
source_closet -- an optional provenance field pointing to a closet-like location in the palace model. It represents the intended bridge between the knowledge graph and the palace structure: a triple can record where it came from so later tooling can trace back toward the original memory. In the current public repository, the field exists in the schema and APIs, but it should not be read as evidence that every triple is automatically populated with a complete closet-to-verbatim provenance chain.
source_file -- the original file path. Lower-level provenance information than source_closet.
extracted_at -- the time the triple was entered into the system. Note this differs from valid_from: a fact might have become effective in 2025 but not entered into the system until 2026.
Finally, the index design (knowledge_graph.py:82-84):
CREATE INDEX IF NOT EXISTS idx_triples_subject ON triples(subject);
CREATE INDEX IF NOT EXISTS idx_triples_object ON triples(object);
CREATE INDEX IF NOT EXISTS idx_triples_predicate ON triples(predicate);
CREATE INDEX IF NOT EXISTS idx_triples_valid ON triples(valid_from, valid_to);
Four indexes covering the triple's three dimensions plus the time window. idx_triples_valid is a composite index covering both valid_from and valid_to, enabling efficient execution of time range queries.
erDiagram
ENTITIES {
int id PK
text name
text type
text properties
}
TRIPLES {
int id PK
text subject
text predicate
text object
text valid_from
text valid_to
real confidence
text source_closet
}
ENTITIES ||--o{ TRIPLES : "subject/object"
Writing: add_triple()
The add_triple() method (knowledge_graph.py:110-167) is the knowledge graph's primary write interface. The method signature is:
def add_triple(
self,
subject: str,
predicate: str,
obj: str,
valid_from: str = None,
valid_to: str = None,
confidence: float = 1.0,
source_closet: str = None,
source_file: str = None,
):
Several design details are worth noting.
Automatic entity creation. Before inserting a triple, the method automatically creates entity records for the subject and object (if they do not already exist):
conn.execute("INSERT OR IGNORE INTO entities (id, name) VALUES (?, ?)", (sub_id, subject))
conn.execute("INSERT OR IGNORE INTO entities (id, name) VALUES (?, ?)", (obj_id, obj))
(knowledge_graph.py:134-135)
INSERT OR IGNORE means the operation is skipped if the entity already exists. This frees callers from needing to worry about "has this entity been registered before" -- just add triples, and entities automatically appear in the graph. This further reinforces the "tolerant input" design philosophy.
Deduplication check. Before inserting a new triple, the method checks whether an identical, still-valid triple already exists:
existing = conn.execute(
"SELECT id FROM triples WHERE subject=? AND predicate=? AND object=? AND valid_to IS NULL",
(sub_id, pred, obj_id),
).fetchone()
if existing:
conn.close()
return existing[0] # Already exists and still valid
(knowledge_graph.py:139-146)
Note the valid_to IS NULL in the query condition -- only currently valid triples are checked. If the same relationship once existed but has been marked as ended (valid_to is not null), re-adding the same relationship creates a new record rather than reviving the old one. This is intuitive: if Kai once worked on the Orion project, then left, and has now returned, it should be two separate work stints, not one continuous one.
Triple ID generation. Each triple's ID is a composite string: t_{subject}_{predicate}_{object}_{hash}, where the hash is based on the first 8 characters of the MD5 of valid_from and the current timestamp (knowledge_graph.py:148). This ensures that even when multiple same-type relationships exist between the same entity pair (covering different time periods), each record has a unique ID.
Querying: query_entity()
The query_entity() method (knowledge_graph.py:186-241) is the most critical query interface. Its parameter design precisely embodies the temporal KG's query model:
def query_entity(self, name: str, as_of: str = None, direction: str = "outgoing"):
Three parameters, three dimensions:
name: The entity to query.as_of: An optional temporal snapshot. If provided, only facts valid at that point in time are returned.direction: Relationship direction."outgoing"queries relationships where the entity is the subject (entity -> ?),"incoming"queries relationships where the entity is the object (? -> entity),"both"queries both directions.
The SQL implementation of the as_of parameter is the essence of this query logic (knowledge_graph.py:201-203):
if as_of:
query += " AND (t.valid_from IS NULL OR t.valid_from <= ?) AND (t.valid_to IS NULL OR t.valid_to >= ?)"
params.extend([as_of, as_of])
The meaning of this condition is: a fact is valid at the as_of point in time if and only if:
- Its effective date is before or equal to
as_of(or the effective date is unknown), and - Its expiration date is after or equal to
as_of(or it has not yet expired).
valid_from IS NULL is treated as "always valid," and valid_to IS NULL is treated as "not yet ended." This means a fact without temporal information is considered valid at all points in time -- a reasonable default behavior, as it avoids "filtering out a fact just because it lacks a time annotation."
The query result includes a current field (knowledge_graph.py:215):
"current": row[5] is None,
row[5] is valid_to. If it is None (i.e., NULL), the fact is still valid. This allows callers to distinguish current facts from historical facts at a glance.
A Concrete Query Example
Suppose the knowledge graph contains the following triples:
Kai -> works_on -> Orion (valid_from: 2025-06-01, valid_to: 2026-03-01)
Kai -> works_on -> Nova (valid_from: 2026-03-15, valid_to: NULL)
Kai -> recommended -> Clerk (valid_from: 2026-01-01, valid_to: NULL)
Calling kg.query_entity("Kai") without the as_of parameter returns all three records, with the first having current as False and the latter two as True.
Calling kg.query_entity("Kai", as_of="2025-12-01") returns only the first record (Orion), because in December 2025, Kai had not yet recommended Clerk and had not yet moved to Nova.
Calling kg.query_entity("Kai", as_of="2026-04-01") returns the latter two (Nova and Clerk), because by April 2026, Kai had already left Orion.
This is the power of temporal queries: the same entity presents different factual faces at different points in time.
Invalidation: invalidate()
The invalidate() method (knowledge_graph.py:169-182) is used to mark the end of a fact:
def invalidate(self, subject: str, predicate: str, obj: str, ended: str = None):
"""Mark a relationship as no longer valid (set valid_to date)."""
sub_id = self._entity_id(subject)
obj_id = self._entity_id(obj)
pred = predicate.lower().replace(" ", "_")
ended = ended or date.today().isoformat()
conn = self._conn()
conn.execute(
"UPDATE triples SET valid_to=? WHERE subject=? AND predicate=? AND object=? AND valid_to IS NULL",
(ended, sub_id, pred, obj_id),
)
conn.commit()
conn.close()
Design highlights:
- Only updates currently valid records (
valid_to IS NULL). It will not accidentally modify already-ended historical records. - Default end date is today (
ended or date.today().isoformat()). Most of the time, you realize something is no longer true "right now." - No data deletion. Invalidation is not deletion but setting an end time. Historical queries can still see this record.
This "soft delete" strategy means the knowledge graph is a data structure that only grows, never shrinks. Every fact that was once true remains in the graph permanently. While this might sound like it could create storage pressure, for personal or small-team-scale knowledge graphs, a SQLite database file with even tens of thousands of triples is only a few MB -- not a problem at all.
Why SQLite
MemPalace's temporal knowledge graph directly competes with Zep's Graphiti. The README contains a direct comparison (README.md:359-366):
| Feature | MemPalace | Zep (Graphiti) |
|---|---|---|
| Storage | SQLite (local) | Neo4j (cloud) |
| Cost | Free | $25/mo+ |
| Temporal | Yes | Yes |
| Self-hosted | Always | Enterprise only |
| Privacy | Everything local | SOC 2, HIPAA |
Zep's Graphiti uses Neo4j as its underlying graph database. Neo4j is the benchmark product in the graph database space, supporting native graph traversal, the Cypher query language, and distributed cluster deployment. Its capabilities are beyond question -- but for MemPalace's use case, most of these capabilities are excessive.
MemPalace's knowledge graph query patterns are highly concentrated: entity-centric queries for direct relationships with optional temporal filtering. It does not need multi-hop traversal ("find all people with three degrees of separation from Kai"), does not need complex graph algorithms (shortest path, community detection), and does not need horizontal scaling to multi-node clusters.
For this query pattern, SQLite has three decisive advantages:
Zero operations. SQLite is an embedded database that requires no server startup, connection configuration, or process management. It is just a file. ~/.mempalace/knowledge_graph.sqlite3, and that is it. No Docker, no database administrator, no 3 AM alert wake-ups.
Local-first. Data always stays on your machine. No network connection needed, no authentication needed, no worrying about third-party service privacy policy changes. Your knowledge graph sits in your filesystem alongside your code, notes, and photos.
Good enough. How much knowledge graph data can a person or small team accumulate over several years? A few thousand entities, tens of thousands of triples -- this is already a quite substantial knowledge graph. SQLite handles this scale with query times in the millisecond range. MemPalace's index design (four indexes covering the primary query paths) ensures that even if data volume increases tenfold, performance will not become a bottleneck.
Of course, choosing SQLite also means giving up some things: no native graph traversal algorithms, no visual query interface, no multi-user concurrent write capability. But none of these are hard requirements in the personal AI memory system scenario. This is a classic engineering tradeoff: giving up unnecessary capabilities in exchange for zero operational cost.
query_relationship(): Querying by Relationship Type
In addition to entity-centric queries, MemPalace provides a relationship-type-centric query interface (knowledge_graph.py:243-272):
def query_relationship(self, predicate: str, as_of: str = None):
This method returns all triples with a specific relationship type. For example, kg.query_relationship("works_on") returns all "works on a project" relationships, while kg.query_relationship("works_on", as_of="2026-01-01") returns only work relationships that were still valid on January 1, 2026.
This query pattern is particularly useful in contradiction detection. When the system needs to verify the claim "Soren completed the auth migration," it can call query_relationship("assigned_to") to see who the auth-migration project was actually assigned to. We will discuss this mechanism in detail in Chapter 12.
Seeding from Known Facts
The seed_from_entity_facts() method (knowledge_graph.py:338-384) demonstrates how the knowledge graph is initialized. It accepts a structured entity-facts dictionary and batch-creates entities and triples:
def seed_from_entity_facts(self, entity_facts: dict):
"""
Seed the knowledge graph from fact_checker.py ENTITY_FACTS.
This bootstraps the graph with known ground truth.
"""
This method handles multiple relationship types: child_of (parent-child), married_to (marriage), is_sibling_of (sibling), is_pet_of (pet ownership), and loves (interests/hobbies). Each relationship carries an appropriate valid_from timestamp -- parent-child relationships start from the birth date, interests/hobbies start from 2025-01-01.
The comments note that data comes from fact_checker.py ENTITY_FACTS, meaning there is an independent fact verification module maintaining a set of verified baseline facts. The knowledge graph seeding process is essentially converting these baseline facts from one data structure (Python dictionary) to another (SQLite triples). This design decouples "fact source" from "fact storage" -- you can replace the knowledge graph implementation without affecting fact verification logic, and vice versa.
stats(): Graph Overview
The stats() method (knowledge_graph.py:315-334) provides global statistics for the knowledge graph:
return {
"entities": entities,
"triples": triples,
"current_facts": current,
"expired_facts": expired,
"relationship_types": predicates,
}
The distinction between current_facts and expired_facts is particularly meaningful. If a knowledge graph has far more expired_facts than current_facts, it indicates the graph covers a long time span with extensive accumulated history. If current_facts far exceeds expired_facts, most facts were recently entered and are still valid. This ratio itself is metadata, telling you the knowledge graph's "age" and "activity level."
Entity ID Normalization
A seemingly minor but very important design detail is how entity IDs are generated (knowledge_graph.py:92-93):
def _entity_id(self, name: str) -> str:
return name.lower().replace(" ", "_").replace("'", "")
All entity names are normalized before storage: converted to lowercase, spaces replaced with underscores, apostrophes removed. This means "Kai", "kai", and "KAI" all map to the same entity ID "kai".
This design solves a very practical problem: when extracting facts from multiple sources (conversations, documents, code comments), the same entity will almost certainly appear in different cases and formats. Without normalization, the knowledge graph would contain kai, Kai, and KAI as three separate entities with no relationships between them -- but in reality they are the same person.
The normalization function is intentionally simple. It does not attempt to handle complex synonym problems (such as recognizing that "Zhang San" and "Zhang San" in different scripts refer to the same person), nor does it attempt entity disambiguation (same name, different people). It only handles the most common variant cases. More complex entity resolution is left to upstream entity detection modules.
Design Philosophy Summary
Looking back at the entire knowledge_graph.py design, several principles run throughout:
Time is a first-class citizen. Every write operation accepts time parameters; every query operation supports time filtering. Time is not an afterthought annotation but a core dimension of the data model.
Tolerant input, precise output. On write, missing time information, missing entity types, and missing provenance information are all acceptable. On query, filtering by time window is precise, and the distinction between current and historical facts is precise. The system will not refuse to work because data is imperfect, but it will not give vague answers because data is imperfect either.
Grow only, never shrink. invalidate() does not delete data; it only marks the end time. add_triple() does not overwrite ended records; it creates new records. The knowledge graph is a chronicle, and every page is preserved.
Local-first, zero dependencies. SQLite as the storage engine requires no external services, no network connection, and no additional process management. The entire knowledge graph is a single .sqlite3 file in your filesystem.
These principles together form the design philosophy of MemPalace's temporal knowledge graph. It does not pursue the full capabilities of a graph database, but rather achieves the most critical temporal functionality with minimal complexity for the specific scenario of personal AI memory systems.
The next chapter will examine an important application of the temporal KG: contradiction detection. When a new claim conflicts with existing facts in the knowledge graph, how does the system discover and report this inconsistency? The answer lies in the cross-comparison of time windows.
Chapter 12: Contradiction Detection
Positioning: The middle chapter of Part 4, "The Time Dimension." This chapter analyzes how MemPalace uses the temporal knowledge graph to detect attribution conflicts, stale information, and inconsistent dates -- starting from three specific examples in the README, inferring implementation mechanisms, and discussing the engineering tradeoffs between false positives and false negatives.
AI Confidently Gets Things Wrong
Large language models have a well-known characteristic: they make mistakes without hesitation. They do not say "I'm not sure" or "let me check" -- they deliver incorrect answers with exactly the same tone and fluency as correct ones.
When AI serves as your memory system, this characteristic becomes particularly dangerous. If your AI assistant remembers "Soren is responsible for the auth migration" when Maya is actually the responsible party, then every subsequent decision based on this incorrect information is built on sand. Worse yet, you may never discover the error -- because you trust your memory system, just as you trust your own memory.
MemPalace's contradiction detection mechanism is designed precisely to address this problem. It does not try to prevent the AI from making mistakes (currently impossible), but instead sounds the alarm when the AI is about to make one.
Three Concrete Contradiction Scenarios
MemPalace's README demonstrates three different types of contradiction detection (README.md:262-273):
Input: "Soren finished the auth migration"
Output: AUTH-MIGRATION: attribution conflict -- Maya was assigned, not Soren
Input: "Kai has been here 2 years"
Output: KAI: wrong_tenure -- records show 3 years (started 2023-04)
Input: "The sprint ends Friday"
Output: SPRINT: stale_date -- current sprint ends Thursday (updated 2 days ago)
These three examples may seem simple, but they actually represent three completely different detection logics. Let us analyze each one.
Scenario 1: Attribution Conflict
Input: "Soren finished the auth migration"
Output: AUTH-MIGRATION: attribution conflict -- Maya was assigned, not Soren
This statement contains an implicit attribution assertion: the auth migration is Soren's work. To detect this contradiction, the system needs to:
- Identify the entities involved in the statement:
Soren(person) andauth-migration(project/task). - Identify the relationship type in the statement: some kind of "completed" or "responsible for" relationship.
- Query the knowledge graph: who was auth-migration actually assigned to?
From the knowledge_graph.py analyzed in Chapter 11, this query corresponds to querying auth-migration as an entity in the incoming direction:
kg.query_entity("auth-migration", direction="incoming")
Or directly querying a specific relationship type:
kg.query_relationship("assigned_to")
The triple existing in the knowledge graph might be:
Maya -> assigned_to -> auth-migration (valid_from: 2026-01-15, valid_to: NULL)
When the new statement attempts to establish the relationship Soren -> completed -> auth-migration, the system discovers that the assigned_to relationship for auth-migration points to Maya, not Soren. Two different people being associated with the same task's responsibility -- this is an attribution conflict.
The key insight is: this detection does not require understanding the full semantics of natural language. It only needs to do three things -- extract entities, identify relationship types, and cross-reference against known facts. The knowledge graph provides the baseline for comparison, and temporal information ensures the comparison uses currently valid facts.
Scenario 2: Tenure Error
Input: "Kai has been here 2 years"
Output: KAI: wrong_tenure -- records show 3 years (started 2023-04)
This contradiction involves dynamic calculation. The "2 years" in the statement is not a static value that can be directly stored in the knowledge graph -- it needs to be calculated from Kai's start date and the current date.
The triple stored in the knowledge graph might be:
Kai -> started_at -> Company (valid_from: 2023-04, valid_to: NULL)
The detection logic works roughly as follows:
- Extract entity
Kaiand the numerical claim2 yearsfrom the statement. - Query
Kai'sstarted_ator similar employment relationship. - Calculate the actual tenure from
valid_from(2023-04) to the current date. - If the calculated result (~3 years) does not match the stated value (2 years), trigger an alert.
The core capability of this detection comes from the temporal KG's valid_from field. If the knowledge graph only stored "Kai works at the company" as a static fact, it could not determine whether a tenure claim is correct. It is precisely because it stores "Kai started working at the company in April 2023" that the system has the foundation data for calculating tenure.
Note the (started 2023-04) in the output -- the system not only identifies the contradiction but also provides the basis for its judgment. This allows the user to decide: is the date in the knowledge graph wrong, or is the number in the statement incorrect. Contradiction detection does not make the final ruling; it merely presents the inconsistency to the human.
Scenario 3: Stale Date
Input: "The sprint ends Friday"
Output: SPRINT: stale_date -- current sprint ends Thursday (updated 2 days ago)
This scenario detects a more subtle type of contradiction: the statement may have been correct a few days ago but has since become stale.
The knowledge graph might contain two triples about the sprint end date:
Sprint -> ends_on -> Friday (valid_from: 2026-03-20, valid_to: 2026-03-23)
Sprint -> ends_on -> Thursday (valid_from: 2026-03-23, valid_to: NULL)
The first triple has been marked as ended by invalidate() (because the sprint end date was updated two days ago), and the second triple is currently valid.
When the statement references "Friday," the system queries the currently valid sprint end date using the as_of parameter and discovers the current record shows Thursday, not Friday. The (updated 2 days ago) additional information comes from the first triple's valid_to date -- it tells you when the information became stale.
This is the value of the invalidate() method (knowledge_graph.py:169-182) in contradiction detection. It is not deleting incorrect information but recording the lifecycle of information. Old facts become history, new facts take their place, and the system can precisely tell you when this transition occurred.
Implementation Mechanism Inference
From the three scenarios, a generalized contradiction detection process can be extracted:
Input statement
|
v
Entity extraction -- identify people, projects, times, and values from the statement
|
v
Relationship mapping -- infer the relationship type implied in the statement
|
v
Knowledge graph query -- use query_entity() or query_relationship() to retrieve known facts
|
v
Cross-comparison -- compare assertions in the statement against known facts
|
v
Contradiction report -- if inconsistency found, generate a report with contradiction type and evidence
The first step (entity extraction) and second step (relationship mapping) are natural language processing tasks. From MemPalace's overall architecture, this work is most likely performed by an LLM -- either the LLM the user is conversing with (via MCP tool calls) or a lightweight language processing module integrated within MemPalace.
The third step (knowledge graph query) directly uses the KnowledgeGraph class's query methods. From Chapter 11's analysis, query_entity() supports as_of time filtering and direction control, and query_relationship() supports query by relationship type -- these two interfaces are sufficient to cover the query needs of all three contradiction types above.
The fourth step (cross-comparison) is the core judgment logic. It needs to execute different comparison strategies based on the contradiction type:
- Attribution conflict: Check whether the same task/project has been assigned to different people. The comparison condition is the existence of multiple
assigned_to-type relationships for the same object entity with different subjects. - Numerical inconsistency: Dynamically calculate values (tenure, age, etc.) from timestamps and compare against the stated values.
- Stale date: Query currently valid date-type facts and compare against the dates referenced in the statement.
The Role of Confidence
Every triple in the knowledge graph has a confidence field (knowledge_graph.py:72), defaulting to 1.0. This field plays an important role in contradiction detection.
When two facts contradict each other, confidence provides a priority judgment: if the knowledge graph's fact has a confidence of 1.0 (fully certain) and the new statement comes from casual conversation (likely lower confidence), the system is inclined to trust the existing fact. Conversely, if the existing fact already has low confidence, the contradiction may indicate that the new statement provides more accurate information.
Confidence is not the decision criterion for contradiction detection -- the system still reports the contradiction -- but it provides context for the contradiction report. "The knowledge graph has a record with confidence 0.6 that contradicts your statement" and "The knowledge graph has a baseline fact with confidence 1.0 that contradicts your statement" represent different levels of severity.
The Provenance Value of source_closet
When a contradiction is detected, the source_closet field (knowledge_graph.py:74) offers a provenance slot. In the design, that means the system can record not only "Maya was assigned to auth migration" but also where that information came from in the palace model.
The important implementation boundary is that source_closet is currently best understood as an optional tracing field exposed by the schema and APIs, not as proof that the public repository already auto-populates a complete provenance chain for every contradiction. The knowledge graph can represent the connection; the fully automated extraction-and-linking pipeline is still a stronger claim than the current code supports.
Contradiction Classification
From the README's three examples, a contradiction classification system can be extracted. Note the output uses different severity level markers -- attribution conflicts are marked as red (high severity), while tenure errors and stale dates are marked as yellow (moderate severity).
The logic behind this grading:
High severity (attribution conflict): A person is incorrectly attributed as being responsible for something. The consequences of this error can be quite serious -- you might thank the wrong person in a meeting or assign follow-up tasks to the wrong person.
Moderate severity (numerical inconsistency): Tenure, age, or other numerical claims do not match records. This type of error is usually an inadvertent approximation (remembering "2 years" when it is actually "3 years"), with relatively limited consequences, but still worth correcting.
Moderate severity (stale date): A reference to time information that has since been updated. This type of error usually occurs because the speaker did not know the information had changed, not because they misremembered.
A more complete contradiction classification might include:
| Contradiction Type | Severity | Detection Method |
|---|---|---|
| Attribution conflict | High | Cross-referencing different attributees for the same task |
| Numerical inconsistency | Moderate | Dynamic calculation from timestamps, then comparison |
| Stale date | Moderate | Comparing currently valid facts against the statement |
| Status contradiction | High | A concluded fact referenced as ongoing |
| Relationship contradiction | Moderate | Incompatible relationship types existing simultaneously |
False Positives and False Negatives
Any detection system faces the tradeoff between false positives and false negatives. Contradiction detection is no exception.
False Positive Scenarios
Same name, different entities. If the team has two people named "Jordan," one a designer and one a backend engineer. When a statement mentions "Jordan completed the UI design," the system might incorrectly trigger an attribution conflict because the other Jordan is a backend engineer.
From the _entity_id() implementation (knowledge_graph.py:92-93), entity IDs are generated through simple string normalization -- "jordan" is just "jordan", with no disambiguation mechanism. This means same-named entities are merged into a single node, potentially causing false contradictions.
Solutions might include using full names or qualifiers during entity registration ("Jordan Chen" vs "Jordan Kim"), but this requires upstream entity extraction to be sufficiently precise.
Semantic interpretation errors. "Soren helped with the auth migration" and "Soren finished the auth migration" express different relationships -- "helping" does not equal "being responsible." If the system interprets "helping" as an attribution relationship, it produces a false positive.
This type of false positive depends on the precision of the relationship mapping stage. If relationship mapping is too broad (treating all person-task associations as attribution relationships), false positive rates rise; if too strict (only counting explicit "responsible for" and "completed" as attribution), false negative rates rise.
Time granularity mismatch. The knowledge graph's valid_from uses date strings ("2026-01-15"), while statements may use vaguer temporal expressions ("last month," "end of last year"). If the conversion between these is not accurate enough, it can cause false positives.
False Negative Scenarios
Incomplete knowledge graph. If a fact was never entered into the knowledge graph, the system cannot detect contradictions related to it. For example, if the knowledge graph has no record of Maya being assigned to auth migration, then "Soren completed the auth migration" will not trigger any alert -- because the system does not know who was supposed to be responsible.
This is the most fundamental source of false negatives. Contradiction detection can only work within the scope of known facts. The knowledge graph's coverage directly determines contradiction detection's recall rate.
Implicit contradictions. Some contradictions are not direct factual conflicts but can only be discovered through logical inference. For example, "Kai was on vacation all last week" and "Kai reviewed 12 PRs last week" -- these two statements share no entity relationships on the surface, but they are logically contradictory (reviewing 12 PRs while on vacation is unlikely). This type of inferential contradiction exceeds the capability of simple triple comparison.
Gradual contradictions. Some facts do not suddenly become incorrect but gradually diverge from reality. For example, "our team has 5 people" was correct three months ago, but over those three months people joined and left, and the actual count is now 7. If no one explicitly calls invalidate() on the old team size information and enters new data, the knowledge graph continues to believe "5 people" is valid.
Engineering Strategy
Facing the tradeoff between false positives and false negatives, MemPalace's design choice leans toward preferring false positives over false negatives. The reasoning is straightforward: a false positive only costs the user a few seconds of confirmation ("oh, this one is fine"); a false negative might allow an incorrect fact to survive in the system for months, affecting all subsequent answers and decisions.
From the output format in the README, contradiction reports come with complete judgment evidence ("Maya was assigned, not Soren"; "records show 3 years (started 2023-04)"; "current sprint ends Thursday (updated 2 days ago)"). This allows users to quickly assess whether a contradiction report is a genuine contradiction or a false positive, thereby reducing the negative impact of false positives on user experience.
The Closed Loop of Contradiction Detection
Contradiction detection is not an endpoint but the starting point of a closed loop. When a contradiction is detected, there are three possible handling paths:
Path 1: Correct the statement. The user acknowledges the statement was wrong. "Oh right, it is indeed Maya doing the auth migration, not Soren." The knowledge graph requires no changes.
Path 2: Update the knowledge graph. The user confirms the statement is correct and the knowledge graph needs updating. For example, the auth migration responsibility has indeed changed from Maya to Soren. This requires calling invalidate() to end Maya's assigned_to relationship, then using add_triple() to create Soren's new relationship.
Path 3: Flag for investigation. The user is unsure which version is correct. In this case, the contradiction itself is valuable information -- it marks an area of uncertainty in the knowledge graph, reminding the user to verify next time the related topic comes up.
All three paths are better than "letting incorrect information slip through quietly." The core value of contradiction detection is not whether its accuracy rate is 90% or 99%, but that it makes the fact "AI might be making a mistake" go from implicit to explicit.
Deeper Design Considerations
Coupling Between Contradiction Detection and the Temporal KG
Of the three contradiction types, two (numerical inconsistency, stale date) directly depend on the temporal KG's capabilities. If the knowledge graph had no valid_from and valid_to fields, tenure could not be dynamically calculated from start dates, and stale dates could not be distinguished from currently valid dates.
Attribution conflict detection could theoretically be implemented on a static KG (as long as attribution relationships are stored), but in practice, the temporal KG makes detection more precise -- it can distinguish "Maya is currently responsible for auth migration" from "Maya was once responsible for auth migration (but has since handed it off)."
This means contradiction detection is not an independent functional module but a natural extension of the temporal knowledge graph. With the time dimension, contradiction detection is almost "free" -- you just need to perform time-aware comparison between new statements and existing facts.
Scale Boundaries of Contradiction Detection
The current implementation assumes the knowledge graph's scale is manageable -- a few hundred entities, a few thousand triples. At this scale, full scans of query_relationship() results are entirely feasible.
But if the knowledge graph grows to millions of triples (e.g., a large organization's complete knowledge base), the per-record comparison strategy would need to evolve. Possible directions include: building dedicated indexes for contradiction detection (such as an entity-pair indexed attribution relationship table), introducing incremental detection (only checking new triples for contradictions rather than full comparisons each time), or using a rule engine to define contradiction patterns rather than hard-coding detection logic.
However, for MemPalace's target scenario -- personal or small team AI memory systems -- the current implementation is sufficient. The knowledge graph's growth rate is bounded by the user's conversation volume and fact extraction rate, making it unlikely to reach the scale requiring detection strategy optimization within a few years.
This is another engineering tradeoff: design for the current scale rather than over-engineering for an imagined future scale. SQLite handles queries on a few thousand triples in milliseconds, and contradiction detection's additional overhead is negligible. When the day comes that optimization is truly needed, it can be addressed then.
Summary
Contradiction detection is one of the most practically valuable applications of MemPalace's temporal knowledge graph. It elevates the AI memory system from "remember everything" to "remember, and tell you when it remembers incorrectly."
Three contradiction types -- attribution conflict, numerical inconsistency, stale date -- each represent different detection logic, but they share the same infrastructure: a knowledge graph with time windows. valid_from and valid_to not only make historical queries possible but also make staleness detection and dynamic calculation possible.
The next chapter will examine another application of the temporal knowledge graph: timeline narration. When you need to understand the complete history of a project or a person, the timeline() method weaves discrete triples into a readable chronicle.
Chapter 13: Timeline Narration
Positioning: The closing chapter of Part 4, "The Time Dimension." Starting from the
timeline()method's implementation, this chapter demonstrates how discrete temporal triples are transformed into a readable chronicle, and explores its practical value in scenarios such as new-hire onboarding.
From Triples to Story
Knowledge graphs excel at answering structured queries: "What project is Kai working on now?" "Who is responsible for the auth migration?" "When did this fact expire?" But when you need to understand the complete history of an entity -- for example, when a newly joined engineer wants to quickly learn the full story behind a project -- individual triple queries fall short. What you need is not individual isolated facts but a chronologically arranged narrative.
This is the design purpose of the timeline() method: sorting discrete triples by time to form a readable chronicle.
The Implementation of timeline()
The timeline() method is located at knowledge_graph.py:274-311. Its interface is remarkably concise:
def timeline(self, entity_name: str = None):
"""Get all facts in chronological order, optionally filtered by entity."""
One parameter, one choice: you can view a specific entity's timeline or the entire knowledge graph's timeline.
Entity Timeline
When entity_name is provided (knowledge_graph.py:277-289):
if entity_name:
eid = self._entity_id(entity_name)
rows = conn.execute(
"""
SELECT t.*, s.name as sub_name, o.name as obj_name
FROM triples t
JOIN entities s ON t.subject = s.id
JOIN entities o ON t.object = o.id
WHERE (t.subject = ? OR t.object = ?)
ORDER BY t.valid_from ASC NULLS LAST
""",
(eid, eid),
).fetchall()
Three design points are worth noting.
Bidirectional matching. WHERE (t.subject = ? OR t.object = ?) -- the entity may appear on either side of a triple. When querying "Kai"'s timeline, it includes both Kai -> works_on -> Orion where Kai is the subject and Priya -> manages -> Kai where Kai is the object. This guarantees the timeline is complete -- a person's story includes not just what they did but also what happened to them.
Chronological sorting. ORDER BY t.valid_from ASC -- sorted by effective date in ascending order, placing the earliest facts first. This is the natural order of a chronicle: moving from past to present.
NULLs sort last. NULLS LAST -- facts without an explicit effective date are placed at the end of the timeline. These are "don't know when it started" facts. Placing them last rather than first is a reasonable choice: in a chronicle, events with exact dates are more referentially valuable than facts without dates and should be displayed first.
Global Timeline
When entity_name is not provided (knowledge_graph.py:291-298):
else:
rows = conn.execute("""
SELECT t.*, s.name as sub_name, o.name as obj_name
FROM triples t
JOIN entities s ON t.subject = s.id
JOIN entities o ON t.object = o.id
ORDER BY t.valid_from ASC NULLS LAST
LIMIT 100
""").fetchall()
The global timeline queries all triples, also sorted by time, but adds a LIMIT 100 restriction. This is a pragmatic safety valve: if the knowledge graph contains tens of thousands of triples, returning them all at once wastes memory and overwhelms the caller. 100 records is a reasonable default -- enough to show the knowledge graph's overview without causing overload.
Return Structure
Both queries share the same return format (knowledge_graph.py:300-311):
return [
{
"subject": r[10],
"predicate": r[2],
"object": r[11],
"valid_from": r[4],
"valid_to": r[5],
"current": r[5] is None,
}
for r in rows
]
Each record is a dictionary with six fields. subject, predicate, and object constitute the fact itself; valid_from and valid_to mark the time window; current indicates whether it is still valid.
Note that subject and object use r[10] and r[11] -- these are sub_name and obj_name from the SQL JOIN result, meaning the entity's original display name rather than the normalized ID. This is crucial for timeline narration: users should see "Kai" not "kai," "auth-migration" not an internal ID.
A Complete Example
Suppose we have built the following knowledge graph for the Driftwood project:
kg = KnowledgeGraph()
# Project creation
kg.add_triple("Priya", "created", "Driftwood", valid_from="2024-09-01")
kg.add_triple("Priya", "manages", "Driftwood", valid_from="2024-09-01")
# Team assembly
kg.add_triple("Kai", "joined", "Driftwood", valid_from="2024-10-01")
kg.add_triple("Soren", "joined", "Driftwood", valid_from="2024-10-15")
kg.add_triple("Maya", "joined", "Driftwood", valid_from="2024-11-01")
# Technical decisions
kg.add_triple("Driftwood", "uses", "PostgreSQL", valid_from="2024-10-10")
kg.add_triple("Kai", "recommended", "Clerk", valid_from="2026-01-01")
kg.add_triple("Driftwood", "uses", "Clerk", valid_from="2026-01-15")
# Task assignments
kg.add_triple("Maya", "assigned_to", "auth-migration", valid_from="2026-01-15")
kg.add_triple("Maya", "completed", "auth-migration", valid_from="2026-02-01")
# Personnel changes
kg.add_triple("Leo", "joined", "Driftwood", valid_from="2026-03-01")
Calling kg.timeline("Driftwood") returns:
[
{"subject": "Priya", "predicate": "created", "object": "Driftwood", "valid_from": "2024-09-01", "valid_to": None, "current": True},
{"subject": "Priya", "predicate": "manages", "object": "Driftwood", "valid_from": "2024-09-01", "valid_to": None, "current": True},
{"subject": "Kai", "predicate": "joined", "object": "Driftwood", "valid_from": "2024-10-01", "valid_to": None, "current": True},
{"subject": "Driftwood", "predicate": "uses", "object": "PostgreSQL", "valid_from": "2024-10-10", "valid_to": None, "current": True},
{"subject": "Soren", "predicate": "joined", "object": "Driftwood", "valid_from": "2024-10-15", "valid_to": None, "current": True},
{"subject": "Maya", "predicate": "joined", "object": "Driftwood", "valid_from": "2024-11-01", "valid_to": None, "current": True},
{"subject": "Kai", "predicate": "recommended", "object": "Clerk", "valid_from": "2026-01-01", "valid_to": None, "current": True},
{"subject": "Driftwood", "predicate": "uses", "object": "Clerk", "valid_from": "2026-01-15", "valid_to": None, "current": True},
{"subject": "Maya", "predicate": "assigned_to", "object": "auth-migration", "valid_from": "2026-01-15", "valid_to": None, "current": True},
{"subject": "Maya", "predicate": "completed", "object": "auth-migration", "valid_from": "2026-02-01", "valid_to": None, "current": True},
{"subject": "Leo", "predicate": "joined", "object": "Driftwood", "valid_from": "2026-03-01", "valid_to": None, "current": True},
]
From this result set, a human reader or an LLM can reconstruct the complete story of the Driftwood project:
In September 2024, Priya created the Driftwood project and assumed the manager role. In October, Kai joined the team and the team chose PostgreSQL as its database. Soren joined in mid-October, followed by Maya in November.
Moving into 2026, Kai recommended Clerk as the authentication solution in January, and the team formally adopted it on January 15. Maya was simultaneously assigned the auth migration task and completed it on February 1. In March, new member Leo joined the team.
This is the "from triples to story" process. The raw data is a set of discrete, structured factual records; after chronological sorting, they naturally arrange into a narrative arc, with causal relationships emerging -- Kai recommended Clerk before the team adopted Clerk; Maya was assigned the task before completing it.
Sorting, Aggregation, and Formatting
The timeline() method itself only completes the first step: sorting. It arranges triples in ascending valid_from order and returns a chronologically ordered list. But between the raw list and a readable narrative, there are two additional steps typically handled by the caller.
Aggregation
In the raw timeline, each record is an independent triple. But in a narrative, some triples should be aggregated for display. For example, both Kai and Soren joined Driftwood in October 2024; in the narrative, these can be combined as "In October, Kai and Soren joined the team in succession."
The aggregation strategy can be quite simple: group by month (or week, or day) of valid_from, merging similar events within the same time period. The specific grouping granularity depends on the timeline's span -- for project histories spanning several years, monthly grouping makes sense; for spans of only a few weeks, daily grouping is clearer.
Formatting
Timeline data can be formatted into multiple forms:
Plain text chronicle -- like the reconstructed narrative above, suitable for directly presenting to users in conversation.
Structured timeline -- a bulleted list grouped by time period, suitable for quick scanning:
2024-09 Priya created the Driftwood project
2024-10 Kai joined | Chose PostgreSQL | Soren joined
2024-11 Maya joined
2026-01 Kai recommended Clerk | Team adopted Clerk | Maya began auth migration
2026-02 Maya completed auth migration
2026-03 Leo joined
AAAK compressed format -- using MemPalace's AAAK dialect for further compression, suitable as part of AI context:
TL:DRIFTWOOD|PRI.create(24-09)|KAI.join(24-10)|PG.adopt(24-10)|SOR.join(24-10)|MAY.join(24-11)|CLK.rec:KAI(26-01)|CLK.adopt(26-01)|MAY.auth-mig(26-01>26-02)|LEO.join(26-03)
The timeline() method returns structured data rather than formatted text, giving callers maximum flexibility. The MCP server can format it as a natural language reply in conversation, the CLI can format it as a terminal table output, and the AAAK compressor can convert it into an ultra-compact timeline summary.
Application Scenario: New-Hire Onboarding
The most direct application of timeline narration is new-hire onboarding. When a new team member joins, they are not facing a freshly built, blank project -- they are facing a system with history, backed by countless discussions, decisions, reversals, and restarts.
Traditional onboarding relies on two information sources: documentation and oral knowledge transfer. Documentation is often outdated (last updated six months ago), and oral transfer is fragmented (depending on who you asked and what questions you posed).
Timeline narration provides a third pathway: automatically generated, chronologically arranged project history from the knowledge graph.
An Onboarding Scenario
Leo is a new member of the Driftwood team, joining in March 2026. He knows nothing about the project. His AI assistant calls mempalace_kg_timeline via MCP:
> Tell me the history of the Driftwood project
Calling mempalace_kg_timeline("Driftwood")
Driftwood project timeline:
Sep 2024 - Priya created the project and assumed the manager role
Oct 2024 - Kai joined (backend), team chose PostgreSQL
Mid-Oct 2024 - Soren joined (frontend)
Nov 2024 - Maya joined (infrastructure)
Jan 2026 - Kai recommended Clerk over Auth0 (reasons: pricing and DX)
Jan 15, 2026 - Team adopted Clerk, Maya assigned auth migration
Feb 1, 2026 - Maya completed auth migration
Mar 2026 - You (Leo) joined the team
This response gives Leo a complete picture of the project in seconds. He knows who created the project, how the team was assembled, what key technical decisions were made, and why those decisions were made.
More importantly, he can continue probing for details. The timeline is an entry point, not an endpoint:
> Why did Kai recommend Clerk over Auth0?
Calling mempalace_search("Clerk Auth0 decision")
From conversation records in January 2026:
"Kai compared the pricing models of Clerk and Auth0 and found
Clerk to be 40% cheaper at our expected user scale. Additionally,
Clerk's SDK documentation is better and integration tests are
easier to write. The team unanimously agreed to adopt Clerk
at the January 15 meeting."
The timeline provides the overview, and the palace structure's semantic search can provide fuller context. In the design vocabulary, source_closet is the field that can tie those layers together; in the current public repository, it is safer to read that as a provenance hook than as a guarantee that every timeline fact already comes with a fully populated closet-to-verbatim trace.
Comparison with Traditional Onboarding
| Dimension | Traditional Onboarding | Timeline Narration |
|---|---|---|
| Information source | Documentation + oral transfer | Auto-generated from knowledge graph |
| Timeliness | Depends on manual updates | Updates in real-time with the knowledge graph |
| Completeness | Depends on the documentation maintainer's diligence | Covers all recorded facts |
| Interactivity | Static documents | Can probe for details |
| Personalization | Generic documents, no reader differentiation | Can filter by specific role or focus area |
Timeline narration cannot fully replace traditional onboarding. Some knowledge -- team culture, communication style, informal work norms -- is not well-suited for encoding as triples. But for the specific dimension of "a project's technical decision history," timeline narration provides a more reliable approach than either documentation or oral transfer.
Limitations of the Timeline
The timeline() method is deliberately minimalist in design, and this minimalism brings some limitations.
No causal relationships. The timeline is merely a chronologically sorted list of facts. It cannot tell you the causal relationships between facts -- "Kai recommended Clerk" and "the team adopted Clerk" are adjacent in time, and a human reader can infer the former led to the latter, but the timeline data itself does not encode this relationship.
Causal relationships require more complex knowledge representation -- such as caused_by or led_to relationships between events. This exceeds the scope of the current temporal triple model. But from another angle, having an LLM infer causal relationships from chronologically arranged facts is precisely what LLMs excel at. The timeline provides the raw material; the LLM handles the narration.
No importance ranking. All facts are treated equally. "Priya created Driftwood" and "Driftwood uses PostgreSQL" occupy the same position in the timeline, but the former is clearly more important to the project narrative than the latter.
A possible improvement would be introducing importance markers (potentially leveraging the confidence field or adding a new importance field) to allow timeline filtering by importance. But this introduces the question of "who judges importance" -- which creates tension with MemPalace's core philosophy of "not letting AI decide what is important."
The 100-record limit on global timelines. LIMIT 100 is a hard-coded safety valve. For small-scale knowledge graphs this is sufficient, but if the knowledge graph grows to thousands of triples, 100 records may only cover very early history. An improvement would be supporting paginated queries (providing offset and limit parameters) or supporting time range filtering (viewing only the last 6 months of the timeline).
These limitations are all deliberate design tradeoffs. MemPalace's timeline feature is positioned as "good enough" -- providing sufficient structure for an LLM to generate readable narratives, without trying to become a complete timeline analysis tool. Complex causal reasoning, importance judgment, and cross-period trend analysis are better handled at the LLM level; the timeline only needs to provide clean, chronologically sorted structured data as input.
The Relationship Between Timeline and Palace Structure
Timeline narration is a microcosm of collaboration among MemPalace's three subsystems.
The knowledge graph provides structured facts and temporal information. The timeline() method obtains sorted triple lists from here.
The palace structure provides context and original memories. Through the source_closet field, each fact in the timeline can be traced back to a closet in the palace, and from the closet back to the verbatim content in a drawer.
The AAAK dialect provides compression capability. When timelines need to be loaded as part of AI context, AAAK can compress a complete project chronicle into very few tokens.
This collaboration is natural and seamless. The three subsystems each do their own job well -- the knowledge graph manages facts, the palace structure manages memories, AAAK manages compression -- and then cooperate through simple interfaces. No subsystem depends on another's internal implementation details.
This low-coupling, high-collaboration architectural style is one reason MemPalace can achieve rich functionality with only two external dependencies.
Summary
The timeline() method implements the transformation from discrete triples to a chronicle in fewer than 40 lines of code (knowledge_graph.py:274-311). Its design follows MemPalace's consistent philosophy: simple data structures, clear query interfaces, leaving complex presentation and reasoning to the LLM.
Chronological sorting is the most basic narrative structure. The single SQL line ORDER BY t.valid_from ASC transforms a pile of unordered facts into a story arc. Bidirectional matching (subject = ? OR object = ?) ensures the story's completeness. NULLS LAST places facts without time information at the end, preventing them from disrupting the chronicle's main thread.
This is not a complex feature. But it does not need to be. It only needs to provide good enough raw material, letting the LLM complete the final mile of "storytelling." This division of labor -- "infrastructure does simple things, the intelligence layer does complex things" -- is a pattern that recurs throughout MemPalace's architecture.
With this, Part 4, "The Time Dimension," concludes. We have seen how the temporal knowledge graph gives facts lifecycles (Chapter 11), how it leverages temporal information to detect contradictions (Chapter 12), and how it weaves discrete temporal facts into a readable narrative (Chapter 13). Together, these three chapters demonstrate a core insight: in AI memory systems, time is not metadata -- time is the data itself.
Chapter 14: L0-L3 -- The Layered Design of the Four-Tier Memory Stack
Positioning: This chapter dissects the design rationale and implementation of MemPalace's four-tier memory stack. We will analyze layer by layer what problem each solves, how many tokens it consumes, when it loads, and why four layers rather than some other number. Source code analysis is based on
mempalace/layers.py.
A Question About "Waking Up"
When your AI assistant starts a fresh session from scratch, it knows nothing about you. It does not know your name, does not know what project you are working on, does not know you made a critical architectural decision yesterday. Every new session is a complete amnesia.
The naive solution to this problem is: stuff all conversation history into the context window. But as analyzed in earlier chapters of this book, six months of daily AI use produces approximately 19.5 million tokens -- far exceeding any model's context window. Even if context windows expand to 100 million tokens in the future, this brute-force loading approach has a fundamental cost problem: every conversation would bill for millions of tokens, 99% of which are useless in the current conversation.
Another common approach is having the LLM extract "important information" after each conversation and save it as summaries. But as we have repeatedly discussed, this method introduces irreversible information loss at the storage stage.
MemPalace's answer is a four-tier memory stack: not storing more, not storing less, but loading the right amount at the right time.
Why "Stack" Rather Than "Database"
Before discussing the specific layers, it is worth understanding why MemPalace chose the "stack" metaphor.
A traditional database is flat: all data lives on the same level, extracted on demand via query language. But human memory does not work this way. You do not need to recall your own name -- that information is always on the surface of consciousness; you do not need to deliberately think about what you did this morning -- these recent experiences are readily available in "working memory"; but if someone asks about details of a trip three years ago, you need to actively "search" long-term memory.
This layering is not accidental. Cognitive science divides human memory into sensory memory, short-term (working) memory, and long-term memory, each with different capacity, duration, and retrieval cost. MemPalace's four-tier stack directly maps this cognitive structure -- not because biomimicry is the goal, but because this layering happens to solve practical engineering problems in AI context management.
The core question is: how to maximize information utility within a limited token budget?
The answer is to layer by frequency and urgency. Some information is needed in every conversation ("who I am"), some only when a specific topic comes up ("recent discussions about this project"), and some only when explicitly asked ("the discussion about GraphQL last March"). Treating them all at the same level is either too expensive, too slow, or both.
The Four-Tier Overview
Before diving into the details of each layer, here is the complete stack structure:
| Layer | Content | Typical Size | Load Timing | Design Motivation |
|---|---|---|---|---|
| L0 | Identity -- "Who am I" | ~50-100 tokens | Always loaded | AI needs to know its role and basic relationships |
| L1 | Key facts -- the most important memories | ~500-800 tokens (current implementation) | Always loaded | Minimum viable context: team, project, core preferences |
| L2 | Room recall -- on-demand retrieval | ~200-500 tokens/retrieval | On explicit recall() | A batch of relevant context for the current topic |
| L3 | Deep search -- semantic retrieval | Unlimited | On explicit query | Full semantic search across all data |
graph TB
subgraph "Always Loaded"
L0["L0 Identity<br/>~50 tokens"]
L1["L1 Key Facts<br/>~500-800 tokens (current)"]
end
subgraph "On-Demand Loading"
L2["L2 Room Recall<br/>~200-500 tokens"]
L3["L3 Deep Search<br/>Unlimited"]
end
L0 --> L1
L1 -.->|explicit recall by higher layer| L2
L2 -.->|explicit semantic query| L3
In the current v3.0.0 source baseline, L0 + L1 produce a wake-up cost of roughly 600-900 tokens. The README also presents a more aggressive target figure: if AAAK is fully connected to the wake-up path, L0 + L1 can be pushed to about 170 tokens. The two should not be conflated. In the rest of this chapter, discussion of layers.py reflects the 600-900-token current state; discussion of longer-term compression direction refers to the README's 170-token target.
That means: even without AAAK on the default path, MemPalace's wake-up remains a relatively cheap persistent context layer; the README's 170-token version represents the upper bound once the compression path is fully wired through.
L0: The Identity Layer
Layer 0: Identity (~100 tokens) -- Always loaded. "Who am I?"
L0 is the simplest layer in the entire stack and also the most indispensable. It answers a fundamental question: who is this AI assistant?
In the layers.py implementation, the Layer0 class reads identity information from a plain text file (layers.py:34-69):
class Layer0:
"""
~100 tokens. Always loaded.
Reads from ~/.mempalace/identity.txt -- a plain-text file the user writes.
"""
def __init__(self, identity_path: str = None):
if identity_path is None:
identity_path = os.path.expanduser("~/.mempalace/identity.txt")
self.path = identity_path
self._text = None
def render(self) -> str:
if self._text is not None:
return self._text
if os.path.exists(self.path):
with open(self.path, "r") as f:
self._text = f.read().strip()
else:
self._text = (
"## L0 -- IDENTITY\n"
"No identity configured. Create ~/.mempalace/identity.txt"
)
return self._text
Several design choices are worth noting.
Plain text, user-written. Identity is not automatically extracted from conversations but written by the user themselves. This is a deliberate decision. Identity is declarative knowledge -- "I am Atlas, Alice's personal AI assistant" -- that does not need to be mined from massive conversations. Having users define their own identity means identity is always precise, intentional, and controllable.
Filesystem, not database. L0 reads from ~/.mempalace/identity.txt -- an ordinary text file editable with any text editor. This eliminates all the complexity of "how to update identity." Want to change the identity? Edit the file.
Cached reading. The render() method uses _text for simple caching (layers.py:52-65). The file is read only once; afterward, the cached content is returned directly. This is sufficient for L0 -- identity does not change during a session.
Graceful degradation. If the identity file does not exist, L0 does not error out but returns prompt text guiding the user to create the file (layers.py:61-63). The system can always start, regardless of whether configuration is complete.
Token estimation. The token_estimate() method uses a simple heuristic: character count divided by 4 (layers.py:67-68). This is not a precise tokenizer calculation but a good-enough approximation. At L0's scale (typically a few dozen to a hundred tokens), this precision is perfectly acceptable.
A typical identity.txt looks roughly like this:
I am Atlas, a personal AI assistant for Alice.
Traits: warm, direct, remembers everything.
People: Alice (creator), Bob (Alice's partner).
Project: A journaling app that helps people process emotions.
This is approximately 50 tokens. It looks trivial, but it gives the AI a crucial anchor: it knows "who" it is, "whom" it serves, and what its behavioral style should be. Without this anchor, every conversation would need to begin by re-establishing the relationship from "Hello, I am your AI assistant."
L1: The Key Facts Layer
Layer 1: Essential Story (~500-800) -- Always loaded. Top moments from the palace.
If L0 is "who am I," L1 is "what are the most important things I know."
L1's design goal is to load the core facts most likely to be useful in the current conversation within the smallest possible token budget. It does not need to contain all memories -- that is L3's job -- but rather provides a "minimum viable context" that allows the AI to appear as if it "remembers you" without any active searching.
In layers.py:76-168, the Layer1 class implementation reveals several key design decisions:
Auto-generated, not manually maintained. Unlike L0, L1 does not require the user to write it by hand. It automatically extracts the most important memory fragments from ChromaDB's palace data (layers.py:91-168).
Importance ranking. L1 uses a scoring mechanism to decide which memories are most worth loading. The scoring logic is at layers.py:116-128:
scored = []
for doc, meta in zip(docs, metas):
importance = 3
for key in ("importance", "emotional_weight", "weight"):
val = meta.get(key)
if val is not None:
try:
importance = float(val)
except (ValueError, TypeError):
pass
break
scored.append((importance, meta, doc))
scored.sort(key=lambda x: x[0], reverse=True)
top = scored[: self.MAX_DRAWERS]
The code tries to read importance scores from multiple metadata keys -- importance, emotional_weight, weight -- reflecting a pragmatic compatibility strategy: data from different sources may use different key names to mark importance, and L1 tries each in sequence, using the first valid value found. The default value is 3 (moderate importance), ensuring that even without explicit marking, memories can participate in ranking.
Grouped by room. The top N memories after sorting are not simply listed in a flat list but grouped by room for display (layers.py:135-139):
by_room = defaultdict(list)
for imp, meta, doc in top:
room = meta.get("room", "general")
by_room[room].append((imp, meta, doc))
This design gives L1's output structure -- the AI does not see a jumble of scattered facts but information organized by topic. This aligns with the memory palace's core concept: spatial structure itself is the index.
Hard token limits. L1 has two hard constraints: maximum 15 memories (MAX_DRAWERS = 15), and total characters not exceeding 3200 (MAX_CHARS = 3200, approximately 800 tokens). When approaching the limit, the generation process gracefully truncates and adds a "... (more in L3 search)" hint telling the AI it can get more through deep search (layers.py:160-163).
Why ~500-800 tokens in the current implementation, and ~120 in the README's AAAK target? This range was not chosen arbitrarily. The README indicates a design goal of keeping total wake-up cost (L0 + L1) around 170 tokens. In that future path, AAAK would compress L1 to roughly 120 tokens while preserving a compact representation of team members, current projects, key decisions, and core preferences. Without AAAK on the current default path, the same information volume occupies 500-800 tokens, which is still manageable.
The budget was derived backwards from capability: first define what the AI should be able to answer immediately after waking up (who you are, who your team is, what project you are working on, what important decisions and recent high-weight memories matter), then estimate the minimum information needed to support that.
L2: The On-Demand Retrieval Layer
Layer 2: On-Demand (~200-500 each) -- Loaded on explicit recall.
L2 is the middle ground between "passive memory" and "active search."
L0 and L1 are always present, forming the AI's "resident awareness." L3 is deep search, requiring explicit queries. L2 sits between both: a lightweight filtered recall path that a higher layer can call once it already knows which wing / room is relevant.
In layers.py:176-233, Layer2's implementation is quite straightforward:
class Layer2:
"""
~200-500 tokens per retrieval.
Loaded when a higher layer explicitly asks for a specific wing / room.
Queries ChromaDB with a wing/room filter.
"""
def retrieve(self, wing: str = None, room: str = None,
n_results: int = 10) -> str:
L2's core mechanism is filtering rather than searching. It does not use semantic queries but narrows scope through metadata filtering (wing and room) (layers.py:195-205):
where = {}
if wing and room:
where = {"$and": [{"wing": wing}, {"room": room}]}
elif wing:
where = {"wing": wing}
elif room:
where = {"room": room}
kwargs = {"include": ["documents", "metadatas"], "limit": n_results}
if where:
kwargs["where"] = where
results = col.get(**kwargs)
Note this uses col.get() rather than col.query(). get() is ChromaDB's metadata filtering method, involving no vector similarity computation -- it simply returns documents matching the conditions. This means L2 retrieval is deterministic and has zero semantic overhead. Once the higher layer knows it wants wing=driftwood, L2 does not need to do semantic understanding; it only needs to fetch the matching records.
Why 200-500 tokens? This range corresponds to a small batch of room- or wing-filtered memory fragments. Each fragment is truncated to under 300 characters (layers.py:226-228), and with metadata tags, the total stays around one or two paragraphs. It is enough for the AI to reload a narrow local slice of context without crowding out the conversation itself.
L2 solves a subtle but important orchestration problem: without it, a higher layer must choose between keeping only shallow L1 awareness or jumping straight to full semantic search every time it already knows the relevant wing / room. In the current version, that benefit appears as an explicit API, not as automatic topic listening.
L3: The Deep Search Layer
Layer 3: Deep Search (unlimited) -- Full ChromaDB semantic search.
L3 is the only layer that uses semantic search.
The previous three layers all perform "pre-loading" -- based on rules and structure, automatically injecting relevant information before conversation begins or when topics switch. L3 is different: it is an on-demand, full-corpus search, used to answer questions that require retrieval across the entire memory store.
Layer3's core method search() is at layers.py:251-303:
class Layer3:
"""
Unlimited depth. Semantic search against the full palace.
"""
def search(self, query: str, wing: str = None,
room: str = None, n_results: int = 5) -> str:
# ...
kwargs = {
"query_texts": [query],
"n_results": n_results,
"include": ["documents", "metadatas", "distances"],
}
if where:
kwargs["where"] = where
results = col.query(**kwargs)
Here col.query() is used -- ChromaDB's semantic search method. It converts the query text into a vector, ranks the entire collection by cosine similarity, and returns the closest results.
L3's output format design is also noteworthy (layers.py:287-303):
lines = [f'## L3 -- SEARCH RESULTS for "{query}"']
for i, (doc, meta, dist) in enumerate(zip(docs, metas, dists), 1):
similarity = round(1 - dist, 3)
wing_name = meta.get("wing", "?")
room_name = meta.get("room", "?")
# ...
lines.append(f" [{i}] {wing_name}/{room_name} (sim={similarity})")
lines.append(f" {snippet}")
Each result includes three types of information: location (wing/room), similarity score, and content snippet. Location information tells the AI "where in the palace" this memory lives, the similarity score lets the AI judge the result's reliability, and the content snippet provides the actual information.
Relationship with searcher.py. The L3 implementation in layers.py and the search functionality in searcher.py are logically overlapping. searcher.py provides two functions: search() (prints formatted output, searcher.py:15-84) and search_memories() (returns structured data, searcher.py:87-142). Both use the same ChromaDB query() call, differing only in output format -- the former for CLI, the latter for programmatic calls such as the MCP server.
L3 also provides a search_raw() method (layers.py:305-352) that returns raw dictionary lists instead of formatted text. This provides a flexible data interface for upper-layer applications (such as MCP tools).
Unified Interface: MemoryStack
The four layers are exposed through a unified MemoryStack class (layers.py:360-438):
class MemoryStack:
def __init__(self, palace_path=None, identity_path=None):
self.l0 = Layer0(self.identity_path)
self.l1 = Layer1(self.palace_path)
self.l2 = Layer2(self.palace_path)
self.l3 = Layer3(self.palace_path)
def wake_up(self, wing=None) -> str:
"""L0 (identity) + L1 (essential story). ~600-900 tokens."""
parts = []
parts.append(self.l0.render())
parts.append("")
if wing:
self.l1.wing = wing
parts.append(self.l1.generate())
return "\n".join(parts)
def recall(self, wing=None, room=None, n_results=10) -> str:
"""On-demand L2 retrieval."""
return self.l2.retrieve(wing=wing, room=room, n_results=n_results)
def search(self, query, wing=None, room=None, n_results=5) -> str:
"""Deep L3 semantic search."""
return self.l3.search(query, wing=wing, room=room, n_results=n_results)
Three methods, three usage scenarios:
wake_up(): Called once at the start of each session. Injected into the system prompt or first message.recall(): Called on topic changes. When conversation involves a specific project or domain, loads related memories.search(): Called on explicit user questions. Semantic retrieval across all data.
The wake_up() method also supports a wing parameter (layers.py:380-399), allowing L1 content to be filtered by project. If you are working on Driftwood, wake_up(wing="driftwood") loads only key facts related to Driftwood, further reducing token consumption while improving information relevance.
The status() method (layers.py:409-438) provides overall stack diagnostics, including whether the identity file exists, token estimates, and total memory count. This is useful for debugging and operations.
Why Four Layers, Not Two or Eight
This is a design question worth answering seriously.
Why not two layers (always-loaded + search)? Because an important gray area exists between "always loaded" and "search." Imagine a system with only L0+L1 and L3: your AI knows your name and current project (L1), and can search if asked specific questions (L3), but when the higher layer already knows "we are talking about Driftwood," it still has to jump directly to full semantic search. L2 fills that gap by making wing/room recall a separate lightweight path. In the current version, the gain appears as an explicit API rather than automatic topic listening.
Why not three layers (dropping L0, merging identity into L1)? Because identity and facts are fundamentally different in nature. Identity is declarative, user-controlled, and almost never changes. L1's key facts are auto-generated from data, ranked by importance, and updated as new data arrives. Mixing them together means either user identity declarations get crowded out by auto-generated content, or auto-generation logic must carefully avoid the user-written portions. Separating them is cleaner.
Why not more layers? Because each additional layer adds a "when to load" decision point. Four layers already cover the critical timing semantics: always (L0, L1), filtered recall (L2), and explicit search (L3). It is hard to define a fifth meaningful loading trigger. If you try to split L2 into "recent topics" and "older topics," or split L3 into "shallow search" and "deep search," the complexity introduced would likely exceed the benefit.
Four layers is a minimum complete set: one fewer means missing functionality, one more means over-engineering.
The Economics of Token Budgets
Finally, let us do the math.
If you calculate strictly from today's layers.py, MemPalace's default wake-up cost is ~600-900 tokens, not 170. The README's ~170 token / ~$0.70 per year figure represents the target economics after AAAK is connected to the main wake-up path, not the current measured output of mempalace wake-up.
But that does not change the important principle here: the best cost optimization is not making each call cheaper, but making most calls not happen at all.
Whether today's cost is 600-900 tokens or the README's future 170-token path, L0 + L1 are not trying to stuff all history into every conversation. They select only the small amount worth keeping resident all the time. Through four-layer separation, the vast majority of memories never need to be loaded into any single conversation. They sit quietly in ChromaDB, appearing through L2 or L3 only when explicitly needed.
That is the value of layering: not storing less, but loading the right amount at the right time. The full six months of memory are still there, with nothing deleted. In today's implementation, the AI wakes up with 600-900 tokens; on the README's compression roadmap, it could eventually do the same job for less.
The stack is not a filing cabinet. It is a layered decision system about "what is worth remembering right now."
Chapter 15: Hybrid Retrieval -- From 96.6% to 100%
Positioning: This chapter analyzes how MemPalace leaped from 96.6% R@5 with pure vector retrieval to 100% (500/500) with hybrid mode. We will dissect the failure case types in the 3.4%, the step-by-step technical improvement path, the cost and rationale of Haiku reranking, and why 100% should not be simply equated with "forever perfect."
What 96.6% Means
Of the 500 questions in the LongMemEval benchmark, MemPalace's pure ChromaDB mode -- calling no external APIs, using no LLM, running entirely locally -- hit 483. This is a number that needs to be understood in context.
LongMemEval is a standardized AI memory benchmark containing six question types: knowledge update, multi-session reasoning, temporal reasoning, single-session user questions, single-session preference questions, and single-session assistant questions. R@5 (Recall at 5) means: does the correct answer exist within the system's top 5 returned results? 96.6% means that of 500 questions, only 17 had the correct answer outside the top 5 retrieval results.
The system that achieved this score depended on exactly one component: ChromaDB's default embedding model (all-MiniLM-L6-v2). No post-processing, no reranking, no "intelligent extraction." Store raw text, embed, sort by cosine similarity, return.
As BENCHMARKS.md states:
Nobody published this result because nobody tried the simple thing and measured it properly.
This statement points to a deeper discovery: the entire AI memory field has been over-engineering at the storage stage. When Mem0 uses an LLM to extract "user prefers PostgreSQL" and discards the original conversation, when Mastra uses GPT to observe conversations and generate summaries, they are all introducing irreversible information loss at the storage stage. MemPalace proved a counterintuitive fact: preserving raw text and relying on a good embedding model for retrieval is already an extremely powerful baseline.
But 96.6% is not 100%. What do those 17 failed questions tell us?
3.4%: Anatomy of the Failure Cases
Analysis of the failure cases reveals several clear patterns. These patterns are not random -- they point to systematic blind spots in vector retrieval.
Type 1: Embedding Model Underweights Specific Nouns
HYBRID_MODE.md documents typical cases:
- "What degree did I graduate with?" The correct answer is "Business Administration." The embedding model treats "Business Administration" and "Computer Science" as semantically equally close to "what degree" -- both are degree names and are close in embedding space. But only one document simultaneously contains both "degree" and "Business Administration."
- "What kitchen appliance did I buy?" The correct answer is "stand mixer." "Kitchen appliance" is a broad semantic region in embedding space with many related documents. But "stand mixer" as a specific noun appears in only one specific document.
- "Where did I study abroad?" The correct answer is "Melbourne." City names have their embedding signal diluted when surrounded by abundant contextual vocabulary.
Common characteristic: the correct answer depends on a specific noun or phrase, and the embedding model tends to capture "semantic similarity" rather than "exact match." When multiple documents are semantically related to the query, the embedding model cannot distinguish which contains the specific answer word.
Type 2: Temporal Anchors Ignored by Embeddings
"What was the significant business milestone I mentioned four weeks ago?" This type of question contains a temporal anchor -- "four weeks ago." The embedding model does not process temporal information at all. It does not know which date "four weeks ago" corresponds to, nor can it adjust ranking based on document timestamps. The correct document is indeed semantically related to the query (it is about a "business milestone"), but within the top-50 semantic results, its ranking is not high enough because the temporal signal is ignored.
Type 3: Indirect Expression of Preferences
"What database do I prefer?" This type of question is related in embedding space to many documents involving databases. But the way users express preferences is often indirect -- "I find Postgres more reliable in my experience" or "I usually go with Postgres for new projects." The embedding model interprets these sentences as "statements about Postgres" rather than "expressions of preference." When the top-5 results include other more "semantically close" database discussion documents, the document actually containing the preference may rank 6th or 7th.
Type 4: References to Assistant Responses
"You suggested X, can you remind me..." This type of question refers to something the AI assistant said, not the user. But the standard index only stores user utterances. Assistant responses are not in the search scope and naturally cannot match.
From 96.6% to 100%: A Five-Step Leap
MemPalace's improvement path was a series of targeted fixes for specific failure patterns, not speculative generalized optimization. Each step responds to one of the failure types analyzed above. BENCHMARKS.md documents the complete evolution trajectory.
graph LR
V0["Raw<br/>96.6%"] --> V1["v1<br/>97.8%"]
V1 --> V2["v2<br/>98.4%"]
V2 --> V3["v3+rerank<br/>99.4%"]
V3 --> V4["v4+rerank<br/>100%"]
style V4 fill:#2d5,color:#fff
Step 1: Hybrid Scoring v1 (96.6% -> 97.8%)
Problem addressed: Type 1 -- specific nouns underweighted by embeddings.
Method: Layer keyword overlap scoring on top of embedding similarity. Extract meaningful keywords from the query (removing stopwords), calculate the matching proportion of keywords in each candidate document, and use this proportion to adjust the distance score.
HYBRID_MODE.md documents the fusion formula:
fused_dist = dist * (1.0 - 0.30 * overlap)
dist: ChromaDB's cosine distance (lower is better)overlap: proportion of query keywords found in the document (0.0 to 1.0)0.30: boost weight -- up to 30% distance reduction
A concrete example: Document A has semantic distance 0.45 with keyword overlap of 0; Document B has semantic distance 0.52 but complete keyword match. After fusion, A's score remains 0.450, while B becomes 0.364, flipping from behind to ahead in ranking.
The key design choice is expanding the candidate pool: from top-10 to top-50. A larger candidate pool gives keyword reranking more working room -- if the correct answer is at semantic rank 45 but has complete keyword match, it needs to be in the pool to have a chance of being promoted.
Why 30% and not higher? HYBRID_MODE.md explains the weight tuning process. In the full 500-question test, 0.30 and 0.40 performed essentially the same; above 0.40, signs of overfitting began appearing (looking better on a 100-question subset but no improvement on the full 500). 30% is enough to flip edge cases without being so strong as to override clearly better semantic results.
The stopword list itself was also carefully designed:
STOP_WORDS = {
"what", "when", "where", "who", "how", "which", "did", "do",
"was", "were", "have", "has", "had", "is", "are", "the", "a",
"an", "my", "me", "i", "you", "your", ...
}
Only words longer than 3 characters and not in the stopword list are treated as keywords. This filters out function words from questions while preserving content words with retrieval value.
Step 2: Hybrid Scoring v2 (97.8% -> 98.4%)
Problem addressed: Type 2 -- temporal anchors ignored.
Method: For questions containing temporal references ("four weeks ago," "last month," "recently"), calculate the date distance between each candidate document and the target date, giving temporally proximate documents an additional score boost.
days_diff = abs((session_date - target_date).days)
temporal_boost = max(0.0, 0.40 * (1.0 - days_diff / window_days))
fused_dist = fused_dist * (1.0 - temporal_boost)
A maximum 40% temporal boost -- enough to push temporally correct documents to the front, but not enough to completely override the semantic signal. HYBRID_MODE.md specifically explains why 100% is not used: temporal proximity is a strong signal but not a deterministic one; it is a "hint" not a "rule."
Additionally, v2 introduced a two-round retrieval mechanism to handle Type 4 (assistant reference questions): the first round uses the user-only index to find the 5 most likely sessions, then the second round builds a new index for those 5 sessions that includes assistant responses and queries again. This "coarse-then-fine" strategy avoids the semantic signal dilution that comes from globally indexing assistant responses.
The key characteristic of v2: zero LLM calls. All improvements are based on string matching and date arithmetic -- completed entirely locally, requiring no API key and no network connection.
Step 3: Hybrid Scoring v2 + Haiku Reranking (98.4% -> 98.8%)
This is the system's first introduction of an LLM.
The llm_rerank() function implemented at longmemeval_bench.py:2765-2860 reveals the complete reranking mechanism:
def llm_rerank(question, rankings, corpus, corpus_ids, api_key,
top_k=10, model="claude-haiku-4-5-20251001"):
The workflow is extremely concise: take the top-K candidate documents from retrieval, truncate each to the first 500 characters, send them along with the question to Haiku, and ask it to select the one "most likely to contain the answer." The selected document is promoted to rank 1; the rest maintain their original order.
The prompt design is intentionally kept simple (longmemeval_bench.py:2807-2814):
Question: {question}
Below are {N} conversation sessions from someone's memory.
Which single session is most likely to contain the answer?
Reply with ONLY a number between 1 and {N}. Nothing else.
Session 1: {text[:500]}
...
Most relevant session number:
Why select only one instead of performing full reranking? The explanation in HYBRID_MODE.md is direct: requiring full reranking increases prompt complexity and error rates. Selecting the single best is "decisive and reliable." The remaining rankings maintain the hybrid scoring order -- which is already quite good on its own.
The fault-tolerant design is also noteworthy. If the API call fails (timeout, rate limiting, no key), the function catches the exception and returns the original ranking unchanged (longmemeval_bench.py:2851-2858). The system does not crash because the reranking step failed. This is an optional enhancement, not a required dependency.
Step 4: Hybrid Scoring v3 + Haiku Reranking (98.8% -> 99.4%)
Problem addressed: Type 3 -- indirect expression of preferences.
This step introduced preference extraction -- using 16 regex patterns to detect user preference expressions at indexing time:
PREF_PATTERNS = [
r"i've been having (?:trouble|issues?|problems?) with X",
r"i prefer X",
r"i usually X",
r"i want to X",
r"i'm thinking (?:about|of) X",
# ...
]
When a preference expression is detected in a session, the system generates a synthetic document -- for example, "User has mentioned: battery life issues on phone; looking at phone upgrade options" -- and adds it to ChromaDB with the same corpus_id as the original session. This synthetic document directly bridges the semantic gap between query vocabulary and session content.
Simultaneously, v3 expanded the Haiku reranking candidate pool from top-10 to top-20. In some assistant-reference-type failure cases, the correct session ranked 11th or 12th -- just outside Haiku's visible window. Expanding to 20 captured these edge cases, with negligible additional prompt cost.
99.4% is a milestone worth marking: only 3 questions out of 500 remained unhit. More importantly, BENCHMARKS.md records an independent verification -- Palace mode (an entirely different retrieval architecture based on hall classification and two-round navigation) also reached exactly 99.4%. Two independent architectures converging on the same score strongly suggests 99.4% is close to the architectural ceiling of this problem.
Step 5: Hybrid Scoring v4 + Haiku Reranking (99.4% -> 100%)
The last 3 failed questions were analyzed and fixed individually. BENCHMARKS.md documents each one:
Question 1: Quoted phrase. A question contained a precise phrase enclosed in single quotes. Fix: detect phrases inside quotes and give sessions containing that phrase a 60% distance reduction.
Question 2: Insufficient name weight. A temporal reasoning question about a specific person. The embedding model gave insufficient weight to the proper noun. Fix: extract capitalized proper nouns from the query and give sessions mentioning that name a 40% distance reduction.
Question 3: Memory/nostalgia pattern. A preference question involving high school experiences. Fix: add "I still remember X", "I used to X", "when I was in high school X" and similar patterns to the preference extraction patterns.
Result: 500/500. All 6 question types at 100%.
Cross-validation: Haiku vs. Sonnet. BENCHMARKS.md reports a detail: using both Haiku and Sonnet as rerankers achieves 100% R@5, with NDCG@10 of 0.976 and 0.975, respectively -- statistically no difference. Haiku is approximately 3x cheaper and is therefore recommended as the default.
Vector Distance vs. Semantic Understanding
The journey of these five improvement steps reveals a fundamental distinction: vector distance is not semantic understanding.
Vector distance measures the geometric distance between two texts in embedding space. When you ask "What database do I prefer?" and the system has three sessions about databases, the embedding model tells you "all three are related to databases, distances are about the same." But answering this question requires not "finding documents about databases" but "finding the document where the user expressed a database preference." This is a semantic reasoning task, not a distance calculation task.
This is why Haiku reranking is so effective. The embedding model can only say "these documents are related to the query." What Haiku can do is read the query and candidate documents, then reason about which document actually answers the question. The former is geometric computation; the latter is reading comprehension.
But it is worth noting: 96.6% of questions do not need this reasoning. For the vast majority of questions, vector distance is a good enough approximation of semantic understanding. Only 3.4% of edge cases require genuine reading comprehension ability to distinguish "semantically related" from "actually answers."
This proportion matters. It means:
- Vector retrieval itself is an extremely strong baseline that should not be underestimated.
- LLM reranking is not a replacement for core functionality but a patch for edge cases.
- The system is already usable and competitive without any LLM involvement.
Cost: The $0.70/Year Arithmetic
Let us calculate the actual cost of Haiku reranking.
Each reranking call sends: 1 question + 10 candidate documents of 500 characters each = approximately 5000 characters = ~1250 input tokens. Haiku's reply is a single number -- approximately 2 tokens. At Haiku's pricing, each call costs approximately $0.001.
If a user conducts 5 deep-search conversations per day, each triggering one Haiku rerank:
5 calls/day * 365 days * $0.001/call = $1.83/year
But in reality, not every search requires reranking. Most queries get correct results in the pure vector retrieval phase (96.6% of cases). If reranking is only triggered for low-confidence results, actual call frequency is even lower.
The $0.70/year cited in the README more specifically refers to the target wake-up cost -- L0 + L1 loaded once daily after the AAAK wake-up path is fully wired through, at roughly 170 tokens each time. The current default implementation is higher, but still in the same "single-digit dollars per year" band for wake-up alone. This and reranking are two separate cost items, but they remain consistent in magnitude: both are in the "dollars per year" range rather than the "hundreds per year" range.
By contrast, the gap between MemPalace's annual cost (pure local $0, with Haiku reranking ~$1-2) and competitors (Mem0, Zep, etc. at $228-$2,988 annually) is not a matter of percentage points -- it is three orders of magnitude. For complete competitor cost comparison data, see Chapter 23.
Do Not Treat 100% as "Forever Perfect"
This point must be stated clearly, because MemPalace's team themselves have repeatedly emphasized this caveat.
100% R@5 was measured on LongMemEval's 500 questions. These 500 questions cover six types, were designed by an academic team, and represent the most standardized evaluation benchmark for AI memory systems currently available. This score is reproducible, verified, and comes with complete reproduction scripts.
But it is still a specific metric on a specific test set. Here are several boundary conditions to note:
Test set size. 500 questions are sufficient for statistically meaningful comparison (confidence intervals are narrow enough), but insufficient to represent all possible memory retrieval scenarios. Real-world query diversity far exceeds 500 questions.
Question type distribution. LongMemEval's six question types are a classification defined by an academic team. Real user queries may include types beyond these six -- such as cross-modal references ("that architecture diagram you drew for me last time") or metacognitive questions ("how many times have I changed my mind on this issue").
Data characteristics. The benchmark uses conversation data prepared by the research team. Different users' conversation styles, topic distributions, and expression habits may differ significantly.
v4's targeted fixes. The three fixes from v3 to v4 (quoted phrase extraction, name boosting, nostalgia pattern detection) were designed for specific failed questions. These fixes work perfectly on the test set, but are not guaranteed to apply to entirely new failure patterns. This is an inherent limitation of any data-driven optimization.
BENCHMARKS.md expresses this honestly:
The 96.6% is the product story: free, private, one dependency, no API key, runs entirely offline. The 100% is the competitive story: a perfect score on the standard benchmark for AI memory. Both are real. Both are reproducible. Neither is the whole picture alone.
96.6% and 100% are two facets of the same system. The former is an unconditional guarantee -- independent of any external service, works in any environment; the latter is competitive performance on a standard benchmark -- but requires an API key and network connection. Users need to choose which mode to use based on their specific scenario.
The Convergence of Two Independent Paths
Before concluding this chapter, it is worth mentioning one more piece of powerful evidence validating MemPalace's retrieval ceiling.
Alongside the hybrid scoring path (hybrid v1 -> v2 -> v3 -> v4), the team independently developed Palace mode -- an entirely different retrieval architecture based on hall classification and two-round navigation. Palace mode classifies each session into one of five halls (preferences, facts, events, assistant suggestions, general), then at query time first does a compact search within the most likely hall (reducing noise), followed by a hall-weighted search across all data (preventing classification errors from causing misses).
These two paths converge precisely at 99.4% R@5. BENCHMARKS.md calls this "independent architecture convergence" -- different designs, different code paths, same score ceiling. When two independent methods hit the same ceiling, this speaks more powerfully than any single experiment about the ceiling's authenticity.
The final v4 breakthrough from 99.4% to 100% relied on three extremely targeted fixes -- essentially "manually untangling" the last three edge cases. These fixes work, but their targeted nature itself reveals a fact: on this problem, "general-purpose improvements" have reached their limits, and what remains can only be tackled by "picking them off one by one."
This is not pejorative. Quite the opposite -- it demonstrates that MemPalace's foundational architecture -- raw text + structured storage + embedding retrieval -- is strong enough that only a tiny number of highly specific cases require extra handling. 96.6% is the power of architecture. The journey from 3.4% to 0% is the power of precision engineering. Both are indispensable, but their focus differs: the former is transferable; the latter requires adaptation.
If you are designing your own memory system, the design principles behind 96.6% (store raw text, organize structurally, retrieve via embeddings) are directly borrowable. The targeted optimizations from 96.6% to 100% need to be customized based on your own failure cases. This is this chapter's most essential takeaway: do not start from optimization. Start from the baseline, then let the failure cases tell you what to optimize.
Chapter 16: Format Normalization
Positioning: The first gate through which data enters the memory palace. Five chat formats, five different data structures, but the palace accepts only one. This chapter covers how normalize.py accomplishes this translation work in under 250 lines of code -- without any ML, without any API calls, purely through pattern matching and structural transformation.
The Problem: Every Platform Invented Its Own Format
If you converse with AI, your conversation history may be scattered across five different places.
Claude Code stores sessions as JSONL -- each line is a JSON object, with the type field distinguishing human from assistant. Claude.ai's web export is a standard JSON array, where each element has role and content. ChatGPT uses a tree structure -- a mapping field contains a node tree, each node has parent and children, and messages are buried in message.content.parts. Slack exports are message lists where type is fixed as "message" and user identity is hidden in the user field. And there is the most basic format: plain text, where human utterances are marked with > and AI replies follow directly.
Five formats, five data structures, five different understandings of "what constitutes one turn of conversation."
If you want to do any downstream processing on these conversations -- chunking, retrieval, entity detection -- you have two choices.
Choice A: Process each format separately. Write five chunking logics, five entity detection logics, five retrieval logics. Every time a new format is added, all downstream modules must be modified. This is the N x M problem -- N input formats times M processing steps.
Choice B: Unify the format first, then process. Write five format converters and one set of downstream logic. Every time a new format is added, only one more converter is needed. This is the N + M problem.
MemPalace chose B. normalize.py is that N-to-1 translation layer.
Unified Output: The Transcript Format
Regardless of input format, normalize()'s output is always the same text format:
> What the user said
AI's reply
> User's next question
AI's next reply
The rules are simple:
- User utterances begin with
>(borrowing Markdown blockquote syntax) - AI replies follow immediately after the user's message with no prefix
- Each question-answer pair is separated by a blank line
This is MemPalace's internal "lingua franca." Downstream chunkers (Chapter 18), entity detectors (Chapter 17), retrieval engines -- they only need to recognize this one format.
Why this format instead of JSON? Because downstream ultimately needs plain text. Vector embedding needs text, semantic search needs text, and what is displayed to users is also text. Using JSON as an intermediate format means every downstream consumer must parse JSON and then extract text -- an unnecessary level of indirection. Using text directly eliminates this step.
The > marker design is also deliberate. It makes "distinguishing who is speaking" an O(1) operation -- check whether the line starts with >: if yes, it is the user; if no, it is the AI. No state machine needed, no JSON parsing needed, a single startswith(">") suffices.
Detection Branches: Identification Logic for Five Formats
The normalize() function (normalize.py:22) is the entire module's entry point. Its detection logic has three tiers:
flowchart TD
A[Read file content] --> B{"Contains >= 3 lines starting with > ?"}
B -->|Yes| C[Already in transcript format -- return as-is]
B -->|No| D{"Extension is .json/.jsonl<br/>or content starts with { or [ ?"}
D -->|Yes| E[Try JSON normalization]
D -->|No| F[Plain text -- return as-is]
E --> G[Try Claude Code JSONL]
G -->|Success| H[Return transcript]
G -->|Failure| I[JSON.parse entire content]
I -->|Failure| F
I -->|Success| J[Try Claude.ai / ChatGPT / Slack in sequence]
J -->|One succeeds| H
J -->|All fail| F
The first tier of judgment is at normalize.py:37-39:
lines = content.split("\n")
if sum(1 for line in lines if line.strip().startswith(">")) >= 3:
return content
If the file already has 3 or more lines starting with >, it is considered already in transcript format and returned directly. The threshold of 3 instead of 1 or 2 is to avoid false triggering from occasional blockquotes in Markdown files.
The second tier is at normalize.py:43:
ext = Path(filepath).suffix.lower()
if ext in (".json", ".jsonl") or content.strip()[:1] in ("{", "["):
normalized = _try_normalize_json(content)
This uses a dual condition -- checking both the extension and the first character of the content. The extension covers normally named files; content sniffing covers files with incorrect extensions (e.g., .txt that actually stores JSON).
The third tier is a fallback. If it is neither transcript nor JSON, it is treated as plain text and returned as-is. Plain text is handled during the chunking stage using a paragraph-based chunking strategy (see Chapter 18).
Parsing the Five Specific Formats
Format 1: Claude Code JSONL
Claude Code's session export is JSONL format -- each line is an independent JSON object. The parsing function is _try_claude_code_jsonl() (normalize.py:71).
Input example:
{"type": "human", "message": {"content": "Explain Python's GIL"}}
{"type": "assistant", "message": {"content": "The GIL is the Global Interpreter Lock..."}}
{"type": "human", "message": {"content": "Is multithreading still useful then?"}}
{"type": "assistant", "message": {"content": "Yes, it depends on your scenario..."}}
The key identification signal is the type field -- "human" or "assistant". The parsing logic reads line by line, skipping lines that fail to parse, skipping non-dictionary lines, extracting only type and message.content.
The final validation condition is at normalize.py:92:
if len(messages) >= 2:
return _messages_to_transcript(messages)
return None
At least 2 messages are required for the parse to be considered successful. This is the minimum threshold shared across all format parsers -- one question and one answer constitute a meaningful conversation.
Note this function's position within _try_normalize_json() (normalize.py:54) -- it is placed first among all JSON parsers, and executes before json.loads(). The reason is that JSONL is not valid JSON; calling json.loads() on the entire content would fail. So line-by-line parsing must be attempted first.
Format 2: Claude.ai JSON
Claude.ai's web export is standard JSON. The parsing function is _try_claude_ai_json() (normalize.py:97).
Input example:
[
{"role": "user", "content": "What is a memory palace?"},
{"role": "assistant", "content": "A memory palace is an ancient mnemonic technique..."}
]
Or wrapped in an outer object:
{
"messages": [
{"role": "human", "content": "..."},
{"role": "ai", "content": "..."}
]
}
This parser's flexibility is evident in two areas. First, the outer structure -- it accepts both a direct array and an object containing messages or chat_messages keys (normalize.py:99-100). Second, role names -- both "user" and "human" are recognized as the user, and both "assistant" and "ai" are recognized as the AI (normalize.py:109-112). This lenient parsing strategy is not sloppiness but pragmatism -- Claude's API and web interface have used different field names across versions, and rather than guessing "which does the current version use," it supports all of them.
Format 3: ChatGPT conversations.json
ChatGPT's export format is the most complex of all formats. The parsing function is _try_chatgpt_json() (normalize.py:118).
ChatGPT does not store conversations in a linear array but in a tree. Why? Because ChatGPT supports "edit a previous message and regenerate" -- users can go back to any point in the conversation, modify their prompt, and generate a new branch. This tree represents those branches.
Input example (simplified):
{
"mapping": {
"root-id": {
"parent": null,
"message": null,
"children": ["msg-1"]
},
"msg-1": {
"parent": "root-id",
"message": {
"author": {"role": "user"},
"content": {"parts": ["What is a vector database?"]}
},
"children": ["msg-2"]
},
"msg-2": {
"parent": "msg-1",
"message": {
"author": {"role": "assistant"},
"content": {"parts": ["A vector database is a specialized..."]}
},
"children": []
}
}
}
The identification signal is the top-level mapping key (normalize.py:120). The parsing strategy is to find the root node (a node with null parent and no message), then follow each node's first children all the way down, forming a linear path.
There is a design decision here: when a node has multiple children (i.e., the user edited a message creating a branch), only the first branch is taken (normalize.py:153). This means branch history is lost. This is a deliberate tradeoff -- for memory storage, preserving "the final direction of the conversation" is more valuable than preserving "all possible branches." If all branches were preserved, subsequent retrieval would produce many near-duplicate results.
The traversal process also uses a visited set (normalize.py:140) to prevent circular references -- though normal ChatGPT exports should not have cycles, defensive programming is always worthwhile.
Format 4: Slack JSON
Slack's channel export is a message array. The parsing function is _try_slack_json() (normalize.py:159).
Input example:
[
{"type": "message", "user": "U123", "text": "Should we run tests before deploying?"},
{"type": "message", "user": "U456", "text": "Must run. CI is configured for it."},
{"type": "message", "user": "U123", "text": "OK, I'll run a local round first."}
]
Slack has a fundamental difference from other formats: it is not a "user vs. AI" conversation but a "person vs. person" conversation. There is no natural "questioner" and "answerer" role.
The parser's handling strategy is clever (normalize.py:177-186): the first user to appear is labeled as "user", the second as "assistant". If there is a third, fourth participant, their roles depend on the current last_role value -- alternating between user and assistant.
if not seen_users:
seen_users[user_id] = "user"
elif last_role == "user":
seen_users[user_id] = "assistant"
else:
seen_users[user_id] = "user"
This does not claim "someone is an AI" but uses the user/assistant alternating structure to ensure downstream question-answer pair chunking works correctly. In the transcript format, > marks the "questioner" and the unmarked text is the "responder." For Slack DMs, who is asking and who is answering alternates naturally -- you ask me a question, I answer, then I ask you a question, you answer. Alternating role assignment happens to match this natural conversational rhythm.
Format 5: Plain Text
Plain text has no dedicated parser. If a file is neither in transcript format nor JSON format, normalize() returns the original content directly (normalize.py:48).
These files are handled during the chunking stage by convo_miner.py's _chunk_by_paragraph() (see Chapter 18), chunking by paragraphs or line groups.
Content Extraction: Unified Handling of Polymorphic Content
Among the five formats, "message content" is represented differently. Claude Code's content may be a string or an array containing {"type": "text", "text": "..."}; ChatGPT's content is buried in content.parts as a string array; Claude.ai's may also be a string or array.
The _extract_content() function (normalize.py:192) is the unified entry point for handling this polymorphism:
def _extract_content(content) -> str:
if isinstance(content, str):
return content.strip()
if isinstance(content, list):
parts = []
for item in content:
if isinstance(item, str):
parts.append(item)
elif isinstance(item, dict) and item.get("type") == "text":
parts.append(item.get("text", ""))
return " ".join(parts).strip()
if isinstance(content, dict):
return content.get("text", "").strip()
return ""
Three types, three branches, with an empty string fallback. This function is shared by all format parsers, avoiding duplicated content polymorphism handling in each parser.
Note the list branch's two treatments of array elements: if the element is a string it is taken directly; if the element is a dict with type equal to "text", the text field is taken. This covers Claude API's content block format ([{"type": "text", "text": "..."}, {"type": "image", ...}]), while automatically skipping non-text content blocks such as images.
Transcript Generation: From Message Lists to Transcript
All format parsers ultimately call _messages_to_transcript() (normalize.py:209), converting [(role, text), ...] lists into transcript text.
The core logic of this function is pairing -- after finding a user message, check if the next one is assistant; if so, pair them together; if not (e.g., two consecutive user messages), output the user message alone.
while i < len(messages):
role, text = messages[i]
if role == "user":
if _fix is not None:
text = _fix(text)
lines.append(f"> {text}")
if i + 1 < len(messages) and messages[i + 1][0] == "assistant":
lines.append(messages[i + 1][1])
i += 2
else:
i += 1
else:
lines.append(text)
i += 1
lines.append("")
There is another detail here: user messages go through spellcheck (spellcheck_user_text, brought in via optional import). Why only check user text? Because AI output almost never has spelling errors, while users typing in chat boxes frequently make typos. Correcting these spelling errors at the normalization stage can improve the accuracy of subsequent vector retrieval -- "waht is GIL" and "what is GIL" may have a non-trivial distance in embedding space.
The Deeper Logic of Architectural Choices
Returning to the N + M vs. N x M question from the beginning. MemPalace's normalization layer is not just about "writing less code." It brings three deeper benefits.
First, it decouples input from processing. When a new AI conversation platform appears in late 2025 (say, Gemini's export format), you only need to add a _try_gemini_json() function in normalize.py. The chunker does not need to change, the retriever does not need to change, the entity detector does not need to change.
Second, it makes testing manageable. Downstream modules only need test cases written in transcript format. No need to prepare test data in five formats or maintain five testing matrices. The correctness of format conversion is independently guaranteed by normalize.py's unit tests.
Third, it largely preserves data semantics. The normalization process mainly performs format conversion rather than content rewriting (aside from spellcheck). The original question-answer structure and conversation order are generally preserved. But "speaker identity is fully preserved" would be too strong: in the Slack path, multi-party conversations can collapse into a simpler user / assistant alternation for chunking purposes. This still preserves sequence and exchange structure reasonably well, but it does not always preserve the original speaker identity in full fidelity.
The entire normalize.py file is only 253 lines, has no external dependencies (using only json, os, and pathlib from the standard library), and calls no network APIs. It runs locally with near-instantaneous speed. This minimalism is not accidental -- it is a direct manifestation of the engineering principle that "the first layer of a data pipeline should be as simple and reliable as possible." If the normalization layer itself is complex enough to require debugging, it defeats its own purpose.
Summary
Format normalization solves a problem that seems inconspicuous but is critically important: building a bridge between the chaotic real-world data and a clean internal representation. Five formats come in, one format goes out. Downstream modules never need to care about "was this data originally exported from ChatGPT or Slack."
Key design points:
- Detection priority: transcript passthrough > JSONL line-by-line attempt > whole-content JSON parse > plain text fallback
- Lenient parsing: multiple role names accepted (
user/human,assistant/ai), flexible outer structure (array or object) - Unified output:
> user turn+assistant response+ blank line, minimal yet informationally complete - Lossless conversion: changes format only, not content (except for spellcheck)
- Zero external dependencies: standard library suffices
The next chapter will examine how normalized text is scanned to discover the people and projects mentioned within it -- without machine learning, without NER models, using only regular expressions and a scoring algorithm.
Chapter 17: ML-Free Entity Discovery
Positioning: Finding names of people and projects in conversation text -- without spaCy, without transformers, without any trained model. Using only regular expressions, frequency statistics, and a signal scoring system. This chapter explains why MemPalace chose rules over NER, and how this rule-based system works in detail.
Why Not NER
Named Entity Recognition (NER) is a classic task in natural language processing. spaCy, Stanford NER, and various transformer models on Hugging Face can all do it. Give them a piece of text and they tell you which words are person names, place names, or organization names.
For MemPalace, using NER would mean the following.
First, dependencies. spaCy itself is approximately 30MB; add a language model (en_core_web_sm is ~12MB, en_core_web_lg is ~560MB), and the installation size balloons from MemPalace's few hundred KB to tens or hundreds of MB. And that does not count PyTorch -- if using transformer models, PyTorch starts at 2GB. MemPalace's entire dependency list is just chromadb and pyyaml; introducing NER would increase dependency size by two orders of magnitude.
Second, context. NER models are trained on news, Wikipedia, and academic papers. They are excellent at recognizing entities like "Barack Obama," "Google," and "New York" that appear frequently in training data. But MemPalace processes private conversations -- your daughter is named Riley, your project is called MemPalace, your colleague is named Arjun. These names are not in any training set. NER models' recognition rate for them depends on contextual cues, and the contextual cues in chat conversations are often less rich than in news articles.
Third, precision requirements. MemPalace does not need to find all entities in arbitrary text. What it needs is to find people and projects that appear repeatedly in the user's own conversation history. This is a much more constrained problem -- the number of candidate entities is limited (a person's daily conversations typically mention no more than a few dozen names repeatedly), and there is a strong frequency signal (truly important entities will certainly appear repeatedly).
This is not to say NER is bad. In its appropriate scenarios -- processing large volumes of text from unknown sources, needing to identify arbitrary entity types, handling multilingual mixed content -- NER is an irreplaceable tool. But MemPalace's scenario happens to not be that scenario. It processes the user's own, limited, clearly patterned conversation data. Under these constraints, the rule-based approach is sufficient and brings benefits NER cannot: zero additional dependencies, millisecond-level execution, fully local, fully transparent (you can know exactly why a word was identified as a person name).
In fairness, if MemPalace needs to process Chinese conversations in the future (Chinese lacks the uppercase letter signal that is a natural proper noun indicator) or needs to identify organization names, place names, event names, and other entity types, the limitations of the rule-based approach will become apparent. At that point, introducing NER may be the right choice. But in the current scenario of English conversations plus person/project binary classification, rules suffice.
Two-Pass Scanning Architecture
entity_detector.py uses a two-pass scanning architecture (entity_detector.py:8-9):
First pass: Candidate extraction. Find all capitalized words in the text, count frequencies, and filter out stopwords and low-frequency words.
Second pass: Signal scoring and classification. For each candidate word, use a set of regex patterns to detect whether it is "person-like" or "project-like," assign scores, and make the final classification.
This two-pass design has an important benefit: the first pass computation is O(n) -- just one full-text scan and word frequency count. The second pass computation is O(k * n) -- k is the number of candidate words, with one full-text regex match per candidate. Because candidates typically number only a few dozen (the first pass's frequency filter dramatically narrows the scope), the second pass's actual computation is manageable.
First Pass: Candidate Extraction
The extract_candidates() function (entity_detector.py:443) is responsible for extracting candidate entities from text.
The core logic consists of two regexes:
# Single-word proper nouns: starts with uppercase, 1-19 lowercase letters
raw = re.findall(r"\b([A-Z][a-z]{1,19})\b", text)
# Multi-word proper nouns: consecutive capitalized words (e.g., "Memory Palace")
multi = re.findall(r"\b([A-Z][a-z]+(?:\s+[A-Z][a-z]+)+)\b", text)
The first pattern matches individual capitalized words like "Riley," "Claude," or "Python." Length is restricted to 2-20 characters -- words that are too short (like "I") are unlikely to be names, and overly long ones are abnormal as well.
The second pattern matches consecutive capitalized word groups like "Memory Palace" or "Claude Code." This captures multi-word project names or full names.
Extracted candidates go through two layers of filtering:
-
Stopword filtering. The
STOPWORDSset contains approximately 250 common English words (entity_detector.py:92-396), covering pronouns, prepositions, conjunctions, common verbs, technical terms (return,import,class), UI action words (click,scroll), and words that start with a capital letter but are almost never entities (Monday,World,Well). This list is so long because English has many words that can appear capitalized at the start of a sentence -- "Step one is...," "Click the button...," "Well, actually..." -- none of which are entities. -
Frequency filtering. Candidates must appear at least 3 times (
entity_detector.py:463). A truly important person or project will appear repeatedly in conversation history. A name mentioned only once is probably unimportant and more likely to be a false positive.
Second Pass: Signal Scoring
For each candidate that passes the first-pass filter, the score_entity() function (entity_detector.py:486) uses a set of patterns to detect whether it resembles a person or a project.
Person Detection Patterns
MemPalace uses four categories of signals to determine whether a word is a person name:
Signal 1: Verb patterns. People perform certain specific actions. PERSON_VERB_PATTERNS (entity_detector.py:27-48) defines 20 patterns:
| Pattern Type | Example | Weight |
|---|---|---|
| Speech verbs | {name} said, {name} asked, {name} told | x2 |
| Emotion verbs | {name} laughed, {name} smiled, {name} cried | x2 |
| Cognitive verbs | {name} thinks, {name} knows, {name} decided | x2 |
| Volitional verbs | {name} wants, {name} loves, {name} hates | x2 |
| Address patterns | hey {name}, thanks {name}, hi {name} | x2 |
"Riley said" is essentially saying "Riley is a person." Projects do not say things; systems do not laugh. These verbs are extremely strong person signals.
Signal 2: Pronoun proximity. If a name has personal pronouns (she, he, they, her, him, etc.) nearby (within 3 lines before or after), this is a person signal. The detection logic is at entity_detector.py:515-525:
name_line_indices = [i for i, line in enumerate(lines) if name_lower in line.lower()]
pronoun_hits = 0
for idx in name_line_indices:
window_text = " ".join(lines[max(0, idx - 2) : idx + 3]).lower()
for pronoun_pattern in PRONOUN_PATTERNS:
if re.search(pronoun_pattern, window_text):
pronoun_hits += 1
break
The window size is 5 lines (2 before + current + 2 after). Only one pronoun hit is counted per window (break), preventing multiple pronouns in one window from inflating the score. Each hit has weight x2.
Signal 3: Dialogue markers. DIALOGUE_PATTERNS (entity_detector.py:64-69) recognizes speaker annotation formats in conversation text:
DIALOGUE_PATTERNS = [
r"^>\s*{name}[:\s]", # > Speaker: ...
r"^{name}:\s", # Speaker: ...
r"^\[{name}\]", # [Speaker]
r'"{name}\s+said', # "Riley said..."
]
Dialogue markers carry the highest weight: x3 per hit (entity_detector.py:503). Because if a name appears in a dialogue marker position -- "Riley:" or "[Riley]" -- it is almost certainly a person name.
Signal 4: Direct address. If the text contains "hey Riley," "thanks Riley," or "hi Riley," the weight is x4 (entity_detector.py:528-531). This is the highest weight among all person signals, because directly addressing someone by name is almost impossible to confuse with naming a project.
Project Detection Patterns
PROJECT_VERB_PATTERNS (entity_detector.py:72-89) defines another set of patterns:
| Pattern Type | Example | Weight |
|---|---|---|
| Build verbs | building {name}, built {name} | x2 |
| Release verbs | shipping {name}, launched {name}, deployed {name} | x2 |
| Architecture descriptors | the {name} architecture, the {name} pipeline | x2 |
| Version identifiers | {name} v2, {name}-core | x3 |
| Code references | {name}.py, import {name}, pip install {name} | x3 |
Version identifiers and code references carry higher weight (x3), because they are ironclad proof of project status -- people do not have .py suffixes, people are not pip install-ed.
Classification Decision
The classify_entity() function (entity_detector.py:562) makes the final classification based on scoring results. The classification logic considers not just scores but also signal diversity.
Core decision tree:
-
No signals (total score is 0): Classified as
uncertain, maximum confidence 0.4. These words made it to the candidate list purely on frequency, with no contextual cues. -
Person ratio >= 70%, with two or more different signal types, and person score >= 5: Classified as
person. The "two or more different signal types" here is a critical design element (entity_detector.py:587-601). Why is this additional condition needed?Consider this scenario: the text repeatedly contains "Click said..." (describing some UI framework's log output). "Click" would score highly on verb patterns -- "Click said" matches
{name} said. But it only scores on one signal type (speech verbs). A genuine person name would typically trigger multiple signals -- "Riley said" (speech verb) plus nearby "she" (pronoun) plus "hey Riley" (direct address). Requiring two or more different signal types filters out words that appear frequently in one specific sentence pattern but are not actually person names. -
Person ratio >= 70%, but diversity condition not met: Downgraded to
uncertain, confidence 0.4 (entity_detector.py:605-609). The code comment explicitly states the reason: "Pronoun-only match -- downgrade to uncertain." -
Person ratio <= 30%: Classified as
project. -
All other cases: Classified as
uncertain, marked "mixed signals -- needs review."
The Complete Detection Flow
The detect_entities() function (entity_detector.py:632) chains all the above steps together:
-
Collect file contents. Each file reads only the first 5000 bytes (
entity_detector.py:652), with a maximum of 10 files. This is not laziness -- entity detection does not need to read entire files. If a name has not appeared in the first 5KB, it is probably not a core entity. -
File selection is deliberate. The
scan_for_detection()function (entity_detector.py:813) prioritizes prose files (.txt,.md,.rst,.csv), falling back to code files only when fewer than 3 prose files are available. The reason is that code files contain too many capitalized words -- class names, function names, constants -- which would produce numerous false positives. The code comment states this clearly (entity_detector.py:398-399): "Code files have too many capitalized names (classes, functions) that aren't entities." -
Merge text from all files, call
extract_candidates()to extract candidates. -
Call
score_entity()andclassify_entity()for each candidate. -
Group by type, sort by confidence, take the top N (maximum 15 people, 10 projects, 8 uncertain).
Entity Registry: Persistence and Disambiguation
Detected entities need persistent storage and must handle some tricky disambiguation problems. This is the job of entity_registry.py.
The EntityRegistry class (entity_registry.py:268) maintains a JSON file (defaulting to ~/.mempalace/entity_registry.json), storing three categories of information:
{
"version": 1,
"mode": "personal",
"people": {
"Riley": {
"source": "onboarding",
"contexts": ["personal"],
"aliases": [],
"relationship": "daughter",
"confidence": 1.0
}
},
"projects": ["MemPalace", "Acme"],
"ambiguous_flags": ["ever", "max"],
"wiki_cache": {}
}
Entity sources (source) have three priority levels:
- onboarding: Explicitly declared by the user during initialization. Confidence 1.0, unchallengeable.
- learned: Inferred by the system from session history. Confidence depends on the detection algorithm's output.
- wiki: Confirmed via Wikipedia API query. Confidence depends on Wikipedia's description content.
Ambiguous Word Handling
This is the most interesting part of the registry. English has many words that are both common vocabulary and given names -- Ever, Grace, Will, May, Max, Rose, Ivy, Chase, Hunter, Lane...
The COMMON_ENGLISH_WORDS set (entity_registry.py:31-89) lists approximately 50 such words. When these words appear in the registry, the system adds them to the ambiguous_flags list.
Subsequently, each time these words are queried, the _disambiguate() method (entity_registry.py:463) checks context to determine whether the usage is a person name or a common word:
Person name context patterns (entity_registry.py:92-113):
PERSON_CONTEXT_PATTERNS = [
r"\b{name}\s+said\b", # "Ever said..."
r"\bwith\s+{name}\b", # "...with Ever"
r"\bsaw\s+{name}\b", # "I saw Ever"
r"\b{name}(?:'s|s')\b", # "Ever's birthday"
r"\bhey\s+{name}\b", # "hey Ever"
r"^{name}[:\s]", # "Ever: let's go"
]
Common word context patterns (entity_registry.py:116-127):
CONCEPT_CONTEXT_PATTERNS = [
r"\bhave\s+you\s+{name}\b", # "have you ever"
r"\bif\s+you\s+{name}\b", # "if you ever"
r"\b{name}\s+since\b", # "ever since"
r"\b{name}\s+again\b", # "ever again"
r"\bnot\s+{name}\b", # "not ever"
]
The disambiguation logic is straightforward: count how many times each group of patterns matches. If person name patterns score higher, classify as a person name; if common word patterns score higher, classify as a concept. If tied, return None -- letting the caller fall back to default behavior (words already registered as person names are still treated as names on a tie, since the user has already declared them).
This disambiguation mechanism means the system can correctly handle conversations like:
> Have you ever tried the new API? <- "ever" = adverb
> I went to the park with Ever yesterday. <- "Ever" = person name
Wikipedia Lookup
For unfamiliar capitalized words that are neither in the registry nor on the stopword list, the registry provides Wikipedia query functionality (entity_registry.py:179).
The query uses Wikipedia's REST API (free, no API key required), and determines the word's type based on the returned summary:
- If the summary contains phrases like "given name," "personal name," or "masculine name" (
entity_registry.py:135-161), classified as a person name - If the summary contains phrases like "city in," "municipality," or "capital of," classified as a place name
- If not found on Wikipedia (404), it is actually classified as a person name (
entity_registry.py:249-256) -- a capitalized word not on Wikipedia is very likely someone's specific name or nickname
Query results are cached in wiki_cache to avoid repeated requests.
Continuous Learning from Sessions
The registry's learn_from_text() method (entity_registry.py:553) makes entity detection not limited to the initialization stage. Each time new session text is processed, the system can call this method to run candidate extraction and signal scoring on the text. If new high-confidence person candidates are found (default threshold 0.75), they are automatically added to the registry.
This creates a gradual learning loop: initial onboarding provides seed data, and subsequent sessions may supplement newly discovered entities. But the threshold is intentionally set high (0.75), because the cost of an automatic false addition is not just one wrong entry -- it affects all subsequent query results. It is better to miss some than to misjudge.
Interactive Confirmation
Detection results ultimately need human confirmation. The confirm_entities() function (entity_detector.py:717) provides a concise interactive interface:
==========================================================
MemPalace -- Entity Detection
==========================================================
Scanned your files. Here's what we found:
PEOPLE:
1. Riley [*****] dialogue marker (5x), pronoun nearby (3x)
2. Arjun [***--] 'Arjun ...' action (4x), addressed directly (2x)
PROJECTS:
1. MemPalace [****-] code file reference (3x), versioned (2x)
UNCERTAIN (need your call):
1. Claude [**---] mixed signals -- needs review
Confidence is visualized with filled/hollow dots (entity_detector.py:712) -- 5 dots corresponding to the 0-1.0 confidence range. Each entity also lists the top two triggered signals, letting users understand "why the system thinks this is a person name."
Users can accept all detection results, manually correct misclassifications, or add entities the system missed. They can also pass yes=True to skip interaction and automatically accept all non-uncertain results.
Boundaries of the Rule-Based Approach
The rule-based approach works well in MemPalace's scenario, but understanding its boundaries is equally important:
| Dimension | Rule-Based | ML/NER |
|---|---|---|
| Dependencies | Zero (standard library regex) | spaCy/transformers + model files |
| Execution speed | Milliseconds | Seconds (slower on first model load) |
| English name recognition | Relies on capitalization + context | Based on statistical models, more robust |
| Chinese name recognition | Essentially impossible (no capitalization signal) | Specialized models can handle it |
| Uncommon names | Identifiable as long as frequency is high enough | Depends on training data coverage |
| Explainability | Fully transparent (which rule triggered) | Black box or semi-transparent |
| Applicable scale | Personal conversations (tens to hundreds of files) | Unlimited |
| New entity type extension | Requires handwriting new rules | Fine-tune or swap models |
MemPalace's choice is not "rules are better than NER" but "in this specific scenario, rules have a better return on investment." This is an engineering judgment, not a technological belief.
Summary
The entity detection module uses a two-pass scanning + signal scoring architecture to achieve automatic identification of people and projects in English conversations without introducing any ML dependencies.
Key design points:
- Two-pass scanning: First use frequency filtering to narrow the candidate range, then use signal scoring for precise classification
- Four categories of person signals: Verb patterns, pronoun proximity, dialogue markers, direct address, with weights from x2 to x4
- Diversity requirement: Confirmation as a person requires two or more different signal types, preventing single-pattern false positives
- Persistent registry: Three-tier source priority (onboarding > learned > wiki), context-based disambiguation for ambiguous words
- Gradual learning: Each session may discover new entities, but the threshold for automatic addition is intentionally set high
The next chapter will cover chunking -- cutting normalized text into segments suitable for vector retrieval. Conversation text and project files require entirely different chunking strategies because their minimum semantic units differ.
Chapter 18: The Art of Chunking
Positioning: Half of a vector database's retrieval quality depends on the chunking strategy. Chunks too large fill search results with irrelevant content; chunks too small fracture semantics. This chapter covers MemPalace's two chunking strategies -- fixed windows for project files, Q&A pairs for conversations -- and why conversation text cannot use fixed windows.
Why Chunking Is a Problem
Vector retrieval works like this: convert text fragments into vectors, store them in a database; at query time, convert the question into a vector and find the nearest matches.
What if you skip chunking and store an entire file as a single vector? Two problems arise. First, embedding models have length limits -- most models can only process 512 or 8192 tokens, truncating anything beyond that. Second, even if a model could handle long text, the embedding vector of a long document becomes an "average" of all its topics -- a document that simultaneously discusses database design, deployment strategy, and team management will have its vector land in the middle of all three topics, meaning a search on any single topic will barely find it.
So you must chunk. The question is how.
MemPalace's answer to this problem is: project files and conversation files need different chunking strategies because their minimum semantic units differ.
Project File Chunking: Fixed Window + Paragraph Awareness
The chunk_text() function in miner.py (miner.py:135) handles project file chunking. Its parameters are defined at the top of the file (miner.py:56-58):
CHUNK_SIZE = 800 # chars per drawer
CHUNK_OVERLAP = 100 # overlap between chunks
MIN_CHUNK_SIZE = 50 # skip tiny chunks
800 characters is roughly 150-200 English words, equivalent to a medium-sized paragraph. The choice of 800 rather than a larger value (say 2000) is because project file content is typically compact -- a Python function, a README section, a configuration block. 800 characters is enough to contain a complete logical unit while being small enough to keep retrieval results precise.
The 100-character overlap handles sentences that happen to be split at boundaries. If an important sentence straddles two chunks, the overlap ensures the last 100 characters of the first chunk and the beginning of the next chunk are the same. This means the sentence is complete in at least one chunk.
But chunk_text() doesn't mechanically cut every 800 characters. It has paragraph-aware logic (miner.py:153-161):
if end < len(content):
# 优先在双换行(段落边界)处切割
newline_pos = content.rfind("\n\n", start, end)
if newline_pos > start + CHUNK_SIZE // 2:
end = newline_pos
else:
# 退而求其次,在单换行处切割
newline_pos = content.rfind("\n", start, end)
if newline_pos > start + CHUNK_SIZE // 2:
end = newline_pos
It first tries to cut at double newlines (paragraph boundaries). If a double newline is found in the range [start + 400, start + 800], it cuts there. If no double newline is found, it looks for a single newline. Only if no newline is found at all (e.g., an extremely long unbroken line of text) does it hard-cut at 800 characters.
The start + CHUNK_SIZE // 2 lower bound (i.e., 400 characters) prevents a problem: if a paragraph boundary appears at the very beginning of a chunk (say at character 10), cutting there would produce an extremely small chunk, wasting storage space and retrieval resources. Requiring the cut point to be at least in the second half of the chunk ensures every chunk has sufficient content.
Finally, chunks that are too short (fewer than 50 characters) are skipped (miner.py:164). Blank lines and single-line comments aren't worth being standalone retrieval units.
Conversation Chunking: Q&A Pairs as the Minimum Semantic Unit
Now let's look at conversation file chunking. The chunk_exchanges() function in convo_miner.py (convo_miner.py:52) takes an entirely different approach.
Why Conversations Cannot Use Fixed Windows
Suppose you have this conversation:
> What factors should we consider for our database selection?
Consider three dimensions: first, query patterns -- whether you're primarily OLTP or OLAP;
second, data scale -- projected data volume over the next year; third, team familiarity.
> How does PostgreSQL compare to MySQL?
PostgreSQL is stronger in complex queries and JSON support, while MySQL is more mature
in simple read/write operations and its operational ecosystem. Given your JSON data needs,
I'd recommend PostgreSQL.
If you apply an 800-character fixed window, the possible result is:
[Chunk 1]
> What factors should we consider for our database selection?
Consider three dimensions: first, query patterns -- whether you're primarily OLTP or OLAP;
second, data scale -- projected data volume over the next year; third, team familiarity.
> How does PostgreSQL compare to MySQL?
[Chunk 2]
PostgreSQL is stronger in complex queries and JSON support, while MySQL is more mature
in simple read/write operations and its operational ecosystem. Given your JSON data needs,
I'd recommend PostgreSQL.
The problem is in the last line of Chunk 1: the question "How does PostgreSQL compare to MySQL?" is grouped into Chunk 1, but its answer is in Chunk 2. If a user later searches for "PostgreSQL vs MySQL," Chunk 1 matches the question but doesn't contain the answer, while Chunk 2 contains the answer but lacks the question's context. Neither chunk is complete.
This is why conversations need to be chunked by Q&A pairs. A question and its response are an indivisible semantic unit -- the question defines context, the response provides information. Splitting them apart, both sides lose meaning.
Q&A Pair Chunking Implementation
The _chunk_by_exchange() function (convo_miner.py:66) works as follows:
def _chunk_by_exchange(lines: list) -> list:
chunks = []
i = 0
while i < len(lines):
line = lines[i]
if line.strip().startswith(">"):
# 找到一个用户发言
user_turn = line.strip()
i += 1
# 收集紧跟其后的 AI 回复
ai_lines = []
while i < len(lines):
next_line = lines[i]
if next_line.strip().startswith(">") or next_line.strip().startswith("---"):
break
if next_line.strip():
ai_lines.append(next_line.strip())
i += 1
# 合并为一个分块
ai_response = " ".join(ai_lines[:8])
content = f"{user_turn}\n{ai_response}" if ai_response else user_turn
if len(content.strip()) > MIN_CHUNK_SIZE:
chunks.append({"content": content, "chunk_index": len(chunks)})
else:
i += 1
return chunks
Several details are worth noting:
Chunk boundaries are driven by the > marker. Upon encountering a > line, it begins collecting a Q&A pair. It continues reading downward until the next > line (the next question) or a --- separator. All non-empty lines in between are the AI's response.
AI responses are truncated to the first 8 lines (convo_miner.py:86). This is an intentional limitation -- " ".join(ai_lines[:8]). Why? Because AI responses can be very long (dozens or even hundreds of lines of code, detailed step-by-step explanations), but for vector retrieval, the first few lines typically contain the core answer. Stuffing an entire lengthy response into a single chunk dilutes the vector's semantic focus.
Empty lines are skipped (convo_miner.py:81). Only lines where next_line.strip() is non-empty are collected. This ensures chunk content is compact with no meaningless whitespace.
--- separators serve as hard boundaries (convo_miner.py:80). If the conversation contains --- separator lines (common in Markdown-formatted conversation logs), they terminate the current Q&A pair collection even if the following content doesn't start with >. This is because --- typically indicates a topic change or conversation segment break.
Fallback: Paragraph Chunking
If the text doesn't have enough > markers (fewer than 3), chunk_exchanges() falls back to _chunk_by_paragraph() (convo_miner.py:102):
def _chunk_by_paragraph(content: str) -> list:
chunks = []
paragraphs = [p.strip() for p in content.split("\n\n") if p.strip()]
if len(paragraphs) <= 1 and content.count("\n") > 20:
lines = content.split("\n")
for i in range(0, len(lines), 25):
group = "\n".join(lines[i : i + 25]).strip()
if len(group) > MIN_CHUNK_SIZE:
chunks.append({"content": group, "chunk_index": len(chunks)})
return chunks
for para in paragraphs:
if len(para) > MIN_CHUNK_SIZE:
chunks.append({"content": para, "chunk_index": len(chunks)})
return chunks
This fallback handles two cases:
- Text with paragraph separators (double newline separated): each paragraph becomes a chunk.
- Long text without paragraph separators (more than 20 lines but no double newlines): every 25 lines becomes a chunk.
The number 25 isn't arbitrary -- it roughly corresponds to 800 characters (assuming 30-35 characters per line), keeping it consistent with the project file chunk size.
Parameter Comparison of the Two Strategies
| Parameter | Project Files (miner.py) | Conversation Files (convo_miner.py) |
|---|---|---|
| Chunk unit | Fixed window (800 chars) | Q&A pair (variable size) |
| Overlap | 100 chars | None (no overlap between Q&A pairs) |
| Boundary awareness | Paragraph boundaries (double newline > single newline) | > markers + --- separators |
| Minimum chunk | 50 chars | 30 chars |
| AI response truncation | N/A | First 8 lines |
| Fallback | None (hard cut) | Paragraph chunking / 25-line groups |
Conversation chunking doesn't need overlap because Q&A pairs are naturally separated -- the problem of "sentences straddling chunk boundaries" doesn't exist between Question A's answer and Question B's answer. Each Q&A pair is self-contained.
The conversation chunking minimum threshold (30 characters) is lower than for project files (50 characters) because a brief but meaningful Q&A pair -- such as "> What language?\nPython." -- is only about 20 characters but carries valuable information.
Room Routing: Classification After Chunking
After project file chunking, each chunk needs to be routed to its corresponding "room." The detect_room() function (miner.py:89) uses a three-level priority strategy:
- File path matching: if the file is under the
docs/directory and there's a room called "docs," route directly to that room - Filename matching: if the filename contains a room name
- Content keyword scoring: use the room's keyword list to do keyword counting on the first 2000 characters of the content
Conversation file room routing is different. The detect_convo_room() function (convo_miner.py:194) uses five predefined topic categories:
TOPIC_KEYWORDS = {
"technical": ["code", "python", "function", "bug", ...],
"architecture": ["architecture", "design", "pattern", ...],
"planning": ["plan", "roadmap", "milestone", ...],
"decisions": ["decided", "chose", "switched", ...],
"problems": ["problem", "issue", "broken", ...],
}
These five categories aren't arbitrary -- they correspond to the five types of topics developers most commonly discuss in conversations. If no keywords match, it falls back to "general."
Normalization and Chunking Pipeline
The conversation file processing flow is a clear pipeline (convo_miner.py:302-317):
Raw file → normalize() → chunk_exchanges() → store in ChromaDB
normalize() ensures uniform formatting of content entering the chunker (see Chapter 16), and chunk_exchanges() ensures each chunk is a complete semantic unit. Each chunk is stored as a "drawer" in ChromaDB, tagged with metadata such as wing, room, and source_file.
It's worth noting that the conversation miner supports two extraction modes (convo_miner.py:259): "exchange" (the default Q&A pair chunking) and "general" (a general-purpose extractor that extracts specific types of memories such as decisions, preferences, and milestones). The general extraction mode's chunked results come with a memory_type field that is used directly as the room name, bypassing detect_convo_room()'s topic classification.
Summary
Chunking may look like a simple "split text" operation, but where you split, how large the pieces are, and what constitutes a unit directly determines downstream retrieval quality.
MemPalace's two chunking strategies reflect a fundamental insight: different types of text have different minimum semantic units. The minimum semantic unit of project files is a paragraph -- a block of code, a section of documentation, a configuration block. The minimum semantic unit of conversations is a Q&A pair -- the question defines context, the response provides information, and splitting them apart makes both sides lose meaning.
Key design points:
- Project files: 800-character window + 100-character overlap + paragraph boundary awareness
- Conversation files: Q&A pair chunking delimited by
>markers, AI responses truncated to the first 8 lines - Both strategies share one principle: cut at natural boundaries whenever possible, avoiding semantic fragmentation
- Fallback strategy: when conversations lack
>markers, degrade to paragraph chunking to ensure any input can be processed
Chapter 19: MCP Server -- API Design of 19 Tools
Positioning: This chapter dissects how MemPalace exposes its memory palace to AI through 19 MCP tools, and why the response structure of
mempalace_statusdoesn't just return data -- it simultaneously teaches the AI a language and a behavioral protocol.
One Tool Would Suffice, So Why 19?
The classic tension in API design is granularity. Too coarse -- a single omnipotent tool -- and the AI must express all intent through call parameters, turning the prompt into a micro programming language. Too fine -- one tool per operation -- and the AI drowns in choices, with each decision adding to token consumption.
MemPalace chose 19 tools, not because there happen to be 19 operations, but because these 19 tools map to 5 cognitive categories, each corresponding to a role the AI plays in memory interactions. This isn't a feature checklist -- it's a role model.
Let's start from the source code. Open mcp_server.py -- tools are registered in the TOOLS dictionary at lines 441-688. Each tool is a key-value pair: a name mapped to a dictionary containing description, input_schema, and handler fields. The registration approach is utterly plain -- no decorators, no registry, just a Python dictionary:
TOOLS = {
"mempalace_status": {
"description": "Palace overview — ...",
"input_schema": {"type": "object", "properties": {}},
"handler": tool_status,
},
# ... 18 more tools
}
This plainness isn't laziness. The MCP protocol itself is JSON-RPC -- the client sends tools/list, the server returns the tool list; the client sends tools/call, the server executes and returns results. The handle_request function at mcp_server.py:708-718 handles tools/list requests by simply iterating over the TOOLS dictionary to generate the response. The entire protocol interaction is under 30 lines of code. Plain means transparent, and transparent means any developer can understand the entire registration mechanism in five minutes.
Five Groups of Tools, Five Roles
The 19 tools are divided into five groups by cognitive role. First the overview, then we'll break down why they're organized this way.
Read group (7) -- lets the AI perceive the palace's structure and content: status, list_wings, list_rooms, get_taxonomy, search, check_duplicate, get_aaak_spec.
Write group (2) -- lets the AI store things in the palace: add_drawer, delete_drawer.
Knowledge graph group (5) -- lets the AI operate on entity relationships: kg_query, kg_add, kg_invalidate, kg_timeline, kg_stats.
Navigation group (3) -- lets the AI walk and explore the palace: traverse, find_tunnels, graph_stats.
Diary group (2) -- lets the AI maintain cross-session self-awareness: diary_write, diary_read.
graph TB
MCP[MemPalace MCP Server]
MCP --> R["Read Group 7<br/>status, list_wings, list_rooms<br/>get_taxonomy, search<br/>check_duplicate, get_aaak_spec"]
MCP --> W["Write Group 2<br/>add_drawer, delete_drawer"]
MCP --> K["Knowledge Graph Group 5<br/>kg_query, kg_add<br/>kg_invalidate, kg_timeline<br/>kg_stats"]
MCP --> N["Navigation Group 3<br/>traverse, find_tunnels<br/>graph_stats"]
MCP --> D["Diary Group 2<br/>diary_write, diary_read"]
This grouping isn't based on technical implementation. search and traverse both query ChromaDB underneath, but the former is "finding content" while the latter is "walking paths." kg_query and search are both retrieval, but the former retrieves structured relationships while the latter retrieves unstructured text. The grouping criterion is the role the AI plays when calling the tool: is it observing, recording, reasoning, exploring, or reflecting?
Why does the write group have only 2 tools while the read group has 7? Because of a core asymmetry in memory systems: writing is simple (put things in), reading is complex (find things from different angles). A room has only one door to enter, but seven windows to observe from. list_wings is a bird's-eye view, list_rooms is a local zoom, get_taxonomy is the complete map, search is semantic positioning, check_duplicate is a deduplication gate before storage, get_aaak_spec is a language reference manual. Each window corresponds to a different cognitive need.
mempalace_status: One Tool Call, Triple Payload
Among the 19 tools, mempalace_status holds a special position. It doesn't just return data -- it teaches the AI two things: a language (AAAK) and a behavioral protocol (the memory protocol).
Look at the return structure of tool_status at mcp_server.py:63-86:
def tool_status():
col = _get_collection()
if not col:
return _no_palace()
count = col.count()
wings = {}
rooms = {}
# ... count logic ...
return {
"total_drawers": count,
"wings": wings,
"rooms": rooms,
"palace_path": _config.palace_path,
"protocol": PALACE_PROTOCOL,
"aaak_dialect": AAAK_SPEC,
}
The first four fields are standard status data -- total drawer count, wing distribution, room distribution, storage path. The last two fields are key: protocol and aaak_dialect.
First payload: palace overview. total_drawers, wings, rooms tell the AI "how large your memory is, how it's divided, and what each section contains." This is spatial awareness -- after seeing these numbers, the AI knows to search wing_user for personal preferences and wing_code for technical decisions.
Second payload: memory protocol. PALACE_PROTOCOL is a plain-text instruction defined at mcp_server.py:93-100. It specifies the AI's five-step behavioral protocol:
1. ON WAKE-UP: Call mempalace_status to load palace overview + AAAK spec.
2. BEFORE RESPONDING about any person, project, or past event:
call mempalace_kg_query or mempalace_search FIRST. Never guess — verify.
3. IF UNSURE about a fact: say "let me check" and query the palace.
4. AFTER EACH SESSION: call mempalace_diary_write to record what happened.
5. WHEN FACTS CHANGE: call mempalace_kg_invalidate on the old fact,
mempalace_kg_add for the new one.
This isn't a suggestion -- it's a protocol. It elevates the AI from "a tool caller that has memory available" to "an agent that actively maintains memory." Item 2 is especially critical -- "Never guess, verify" -- directly countering the core weakness of LLMs: hallucination. When the AI is asked "How old is Max?", the protocol requires it to query the knowledge graph first rather than guessing an answer from training data.
Third payload: AAAK dialect specification. AAAK_SPEC is defined at mcp_server.py:102-119 -- a complete compressed language specification. It teaches the AI three things: entity encoding (ALC=Alice, JOR=Jordan), emotion markers (*warm*=joy, *fierce*=determined), and structural syntax (pipe-delimited, star ratings, hall/wing/room naming conventions).
Why embed the language specification in the status response rather than a separate tool? Because of MCP's call timing. When the AI explicitly calls status, and a palace already exists, the most natural single initialization step can do three jobs at once: understand the palace structure, read the behavioral protocol, and learn the compression language. If the AAAK specification were in another tool, the AI would need two calls to complete the same initialization. The runtime caveat matters: if the palace has not been initialized yet, tool_status() returns _no_palace() rather than protocol + aaak_dialect.
The underlying philosophy of this design is: APIs don't just transmit data -- they transmit behavioral patterns. Traditional APIs assume the caller already knows how to use the data. But when the caller is an LLM with no persistent memory, the API must re-educate the caller in every session. The triple payload of mempalace_status is designed precisely for this purpose.
mempalace_search: Restrained Interface for Semantic Retrieval
mempalace_search is the most frequently used tool, but its interface design is extremely restrained. Look at the schema at mcp_server.py:587-600:
"mempalace_search": {
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string"},
"limit": {"type": "integer"},
"wing": {"type": "string"},
"room": {"type": "string"},
},
"required": ["query"],
},
"handler": tool_search,
}
Four parameters, only query is required. wing and room are optional filters. No sorting options, no pagination, no embedding model selection, no distance metric parameters.
This restraint is deliberate. Its underlying logic comes from the retrieval gains provided by the palace structure: in testing on 22,000+ memories, unfiltered full-corpus search achieved only 60.9% R@10, adding wing filtering jumped to 73.1%, and adding wing + room filtering jumped to 94.8%. In other words, filters are the primary precision lever, not the search algorithm itself.
So the interface design focuses on filters -- making it easy for the AI to express "search in this wing's this room" -- while fully encapsulating the complexity of the search algorithm. The AI doesn't need to know whether the underlying system uses ChromaDB's cosine similarity or Euclidean distance -- it only needs to know "give me a query, an optional scope."
The handler implementation is equally concise. tool_search at mcp_server.py:173-180 has only one effective line of code -- directly calling the search_memories function from searcher.py:
def tool_search(query, limit=5, wing=None, room=None):
return search_memories(
query, palace_path=_config.palace_path,
wing=wing, room=room, n_results=limit,
)
search_memories (searcher.py:87-142) returns a structured dictionary containing the original text, wing, room, source file, and similarity score for each result. Note that it returns the original text -- "the actual words, never summaries" -- this is MemPalace's core promise. What the AI receives is verbatim memory, not some summary model's reinterpretation of the memory.
mempalace_add_drawer: Writing Is Deduplication
There are only two write tools, but add_drawer's implementation is more complex than its interface suggests. See mcp_server.py:250-287:
def tool_add_drawer(wing, room, content,
source_file=None, added_by="mcp"):
col = _get_collection(create=True)
# Duplicate check
dup = tool_check_duplicate(content, threshold=0.9)
if dup.get("is_duplicate"):
return {
"success": False,
"reason": "duplicate",
"matches": dup["matches"],
}
# ... generate ID, store, return success
Key behavior: before storing, it automatically calls tool_check_duplicate for semantic deduplication. The threshold is 0.9 -- if the palace already contains a memory with cosine similarity exceeding 90% to the new content, the write is rejected and existing matches are returned.
This design removes deduplication responsibility from the AI. Without this mechanism, the AI would need to manually call check_duplicate before every write, and LLMs frequently "forget" to perform such defensive operations. Built-in deduplication means that even if the AI tries to store the same content repeatedly -- for example, being told the same thing in different sessions -- the palace won't bloat.
The drawer ID generation approach (mcp_server.py:267) is also worth noting: it uses an MD5 hash of the first 100 characters of content plus the current timestamp, taking the first 16 hex digits and prepending wing and room prefixes. This means the same content stored at different times produces different IDs -- but with a deduplication threshold of 0.9, semantically identical content is already intercepted. The ID naming convention (drawer_wing_room_hash) also makes debugging intuitive: seeing an ID immediately tells you which wing and room it belongs to.
mempalace_kg_query: Timeline of Structured Memory
The five knowledge graph tools are the fundamental difference between MemPalace and pure vector retrieval systems. Among them, kg_query is the most frequently used. See mcp_server.py:309-312:
def tool_kg_query(entity, as_of=None, direction="both"):
results = _kg.query_entity(
entity, as_of=as_of, direction=direction)
return {"entity": entity, "as_of": as_of,
"facts": results, "count": len(results)}
Three parameters -- entity name, point in time, direction -- correspond to three query modes:
kg_query("Max")-- all of Max's relationships, past and present.kg_query("Max", as_of="2026-01-15")-- a snapshot of Max's relationships as of January 15, 2026.kg_query("Max", direction="incoming")-- who has relationships with Max.
The underlying implementation of the as_of parameter is in knowledge_graph.py:199-203: it uses the SQL condition valid_from <= ? AND (valid_to IS NULL OR valid_to >= ?) to return only facts valid at the specified date. This means when the AI is asked "What was Max working on last year?", it can see facts that were valid last year rather than today's facts.
The time dimension works in concert with kg_invalidate. When a fact is no longer true -- say Max has left the swimming team -- the AI calls kg_invalidate("Max", "does", "swimming", ended="2026-03-01"). The fact isn't deleted but is marked with an end date. Historical queries can still see it, but current queries won't return it.
This "soft delete" design reflects a deep cognitive insight: memory is not a database -- the termination of a fact is just as important as its existence. Deleting a memory means pretending it never happened. Marking its end date means acknowledging the passage of time. The AI needs the latter capability to correctly answer the difference between "What did Max used to do?" and "What does Max do now?"
Navigation Group: From "Finding Things" to "Walking Paths"
The three navigation tools -- traverse, find_tunnels, graph_stats -- differ fundamentally from the read group's search. search is "I know what I'm looking for, help me find it." Navigation is "I don't know what I'm looking for, take me for a walk."
mempalace_traverse (mcp_server.py:553-569) says it most clearly in its description: "Like following a thread through the palace: start at 'chromadb-setup' in wing_code, discover it connects to wing_myproject (planning) and wing_user (feelings about it)."
Its implementation delegates to the traverse function in palace_graph.py. The underlying logic builds a room-level graph from ChromaDB's metadata -- when the same room name appears in different wings, a "tunnel" forms between them. The AI starts from one room, follows a tunnel to a same-named room in another wing, then sees what other rooms in that wing relate to the starting point.
find_tunnels is more direct -- given two wings (or none, to see bridges between all wings), it returns the rooms that connect them. When the AI needs to understand "what's the relationship between technical decisions and team dynamics," it can call find_tunnels(wing_a="wing_code", wing_b="wing_team") and receive a set of shared room names -- these rooms are the topics where the two domains intersect.
The existence of these three navigation tools explains why the total tool count is 19 rather than 14 or 15. With only read and write, the AI can only do precise retrieval. Adding the knowledge graph, the AI can do structured reasoning. Adding navigation, the AI can do open-ended exploration. The 19 tools cover the complete spectrum of memory interaction: from storage to retrieval to reasoning to exploration to reflection.
Protocol Layer: Minimalist JSON-RPC Implementation
Finally, let's look at the MCP protocol layer itself. The main function at mcp_server.py:746-768 is the entire server's entry point:
def main():
logger.info("MemPalace MCP Server starting...")
while True:
line = sys.stdin.readline()
if not line:
break
request = json.loads(line.strip())
response = handle_request(request)
if response is not None:
sys.stdout.write(json.dumps(response) + "\n")
sys.stdout.flush()
Read from stdin, parse JSON, dispatch, write to stdout. No HTTP, no WebSocket, no framework. The MCP protocol runs over the stdio channel -- the most primitive form of inter-process communication.
This choice has two consequences. First, startup cost is nearly zero -- no port binding needed, no TLS certificates, no service discovery. claude mcp add mempalace -- python -m mempalace.mcp_server completes registration in a single command. Second, the security model is extremely simple -- the MCP server is a subprocess of Claude Code, inheriting the parent process's permissions, requiring no additional authentication mechanism.
handle_request (mcp_server.py:691-743) handles four methods: initialize (handshake), notifications/initialized (confirmation), tools/list (tool catalog), and tools/call (tool invocation). During tool invocation, it looks up the handler from the TOOLS dictionary, unpacks parameters with **tool_args, calls the function, and serializes the return value to JSON. The entire dispatch logic is under 50 lines.
This minimalist implementation isn't a technical limitation -- it's a design philosophy: the protocol layer should be transparent -- all complexity should be in the tools' semantic design, not in the transport mechanism. The grouping logic of 19 tools, the triple payload of status, the built-in deduplication of add_drawer, the temporal filtering of kg_query -- these are where design effort is worth investing. The protocol layer just needs to work.
The Invisible Aspects of Design
Returning to the opening question: why 19 tools?
The answer isn't in the number itself. 19 tools are the natural result of the following constraints:
Read must outnumber write. Because memory's value lies in being retrieved, and retrieval has multiple granularities -- global overview, wing-level listing, room-level listing, semantic search, duplicate checking, language specification reference. Each granularity serves a different cognitive moment.
Knowledge graph must be independent from vector retrieval. Because vector retrieval answers "what content is similar to this query," while the knowledge graph answers "what relationships does this entity have with whom, and at what time." The former is fuzzy matching; the latter is precise reasoning. The AI needs both capabilities.
Navigation must be independent from search. Because search assumes the user knows what they're looking for, while navigation assumes the user wants to explore the unknown. traverse and find_tunnels let the AI discover connections, not just retrieve known ones.
Diary must exist. Because without diary, the AI is merely a tool that searches other people's memories. With diary, it's an agent with its own observation history. The gap between the two isn't a feature gap -- it's a role gap.
19 is the minimum complete set of these constraints. Not 18, because that would mean cutting a cognitive capability. Not 20, because there's no 20th need that can't be covered by the first 19.
Every API ultimately encodes a worldview. MemPalace's 19 tools encode the worldview that: AI doesn't just need to store and retrieve memories -- it needs to reason within structured relationships, explore spatial topology, trace along timelines, and reflect in private diaries. These five capabilities together constitute complete memory interaction.
Chapter 20: Specialist Agent System
Positioning: This chapter analyzes how MemPalace uses the palace's spatial structure -- rather than configuration-file bloat -- to host an unlimited number of specialist agents, and why it models each agent's memory as
wing + diary. AAAK diary format is encouraged in the README and tool descriptions; the part the current MCP runtime actually implements is the storage structure of one wing and one diary room per agent.
The Configuration Dilemma of Specialist Agents
Suppose you have 50 AI agents, each responsible for a specialized domain. One reviews code quality, one tracks architectural decisions, one monitors operational incidents, one records product requirements, one watches for security vulnerabilities. In traditional agent frameworks, each agent needs independent configuration -- system prompts, memory storage, state management, permission scopes. 50 agents means 50 configuration files, or an ever-growing central configuration.
Letta (formerly MemGPT) chose the centralized path: each agent has independent memory blocks, independent system prompts, independent core memory and archival memory. All of these are stored in the cloud and managed via API. This is clean, but costs scale linearly -- the free tier gets 1 agent, the developer tier is $20/month for 10 agents, the business tier is $200/month for 100 agents. Agent count is directly tied to wallet depth.
MemPalace chose an entirely different path: agents don't live in configuration files -- agents live in the palace.
One Agent = One Wing + One Diary
Return to the diary tools in mcp_server.py. The first three lines of tool_diary_write (lines 349-392) reveal the entire architecture:
def tool_diary_write(agent_name, entry, topic="general"):
wing = f"wing_{agent_name.lower().replace(' ', '_')}"
room = "diary"
col = _get_collection(create=True)
When an agent named "reviewer" writes a diary entry, its entry is stored in wing_reviewer/diary. When "architect" writes a diary entry, it goes into wing_architect/diary. An agent's identity is determined by its wing name, and an agent's memory is carried by its diary room. No additional configuration file is needed to "register" an agent -- the wing is automatically created on the first call to diary_write.
The metadata structure (mcp_server.py:368-380) further reinforces this naturalness:
col.add(
ids=[entry_id],
documents=[entry],
metadatas=[{
"wing": wing,
"room": room,
"hall": "hall_diary",
"topic": topic,
"type": "diary_entry",
"agent": agent_name,
"filed_at": now.isoformat(),
"date": now.strftime("%Y-%m-%d"),
}],
)
Each diary entry carries complete spatial coordinates -- wing (which agent), room (diary), hall (hall_diary), topic (topic tag) -- plus temporal coordinates (filed_at, date) and identity markers (agent). That is enough to let diary entries flow into the palace's existing infrastructure: search can find them, list_rooms can count them, and graph-building logic will still see diary as an ordinary room. More cautiously stated, the current code does not provide a diary-specific topic filter or a higher-level agent-orchestration layer; at runtime these are still ordinary drawers with agent-flavored metadata.
Agents aren't add-ons external to the palace. Agents are residents of the palace.
AAAK Diary: Compressed Self-Awareness
When agents write diary entries, the interface contract encourages AAAK. Look at the diary_write tool's description (mcp_server.py:649):
Write to your personal agent diary in AAAK format. Your observations,
thoughts, what you worked on, what matters. Write in AAAK for
compression — e.g. 'SESSION:2026-04-04|built.palace.graph+diary.tools
|ALC.req:agent.diaries.in.aaak|★★★★'
The tool description directly demonstrates the AAAK diary format. A typical agent diary entry looks like this:
PR#42|auth.bypass.found|missing.middleware.check|pattern:3rd.time.this.quarter|★★★★
This single line compresses the following information: during PR #42 review, an authentication bypass vulnerability was found, caused by a missing middleware check, this is the third occurrence of the same pattern this quarter, importance four stars. In natural language, this would take at least three lines. In AAAK, one line does the job.
The tool description directly demonstrates the AAAK diary format. But this is exactly where "interface contract" and "runtime enforcement" must be separated. tool_diary_write does not validate that the input is AAAK, nor does it automatically compress natural language into AAAK; it simply stores whatever string the caller passes in. So the token-efficiency benefit is real, but only if the caller chooses to follow the AAAK convention.
That means the more accurate claim is: if a code review agent consistently writes its observations in AAAK, then reading the latest 10 diary entries can indeed restore recent work patterns with a very small token budget. If it writes plain English, diary_read still works -- it just costs more context.
AAAK diary's pipe-delimited syntax is still especially useful here. pattern:3rd.time.this.quarter records not just a fact but a trend. When the agent later reviews another authentication-related PR, reading these diary entries can bring that pattern back into the current session. A diary is not a log; it is a compressed encoding of a learning curve. In the current implementation, though, that compression happens because the caller voluntarily writes AAAK, not because the MCP server enforces or performs it.
50 Agents, One Line of Configuration
The most counterintuitive feature of MemPalace's agent system is: no matter how many agents you have, your CLAUDE.md (or any system configuration file) doesn't need to change. The configuration shown in the README is:
You have MemPalace agents. Run mempalace_list_agents to see them.
One line. Not "you have an agent called reviewer, it focuses on code quality, its system prompt is..." Instead, it tells the AI: you have agents, go look in the palace to see which ones.
The README then sketches a fuller runtime-discovery mechanism, with agent definition files stored in the ~/.mempalace/agents/ directory:
~/.mempalace/agents/
├── reviewer.json # code quality, pattern recognition, bug tracking
├── architect.json # design decisions, trade-off analysis, architecture evolution
└── ops.json # deployment, incident response, infrastructure
That tells a clear product-direction story: agent descriptors live in a local directory, top-level configuration stays one line long, and runtime discovery fills in the rest. But if you compare this strictly against the current mcp_server.py, one important truth has to be added: the repository does not implement mempalace_list_agents, nor does it load ~/.mempalace/agents/*.json. What is really shipped today is the minimum storage structure provided by diary_write / diary_read; the README's agent directory and discovery flow are better read as a proposed higher-level workflow.
So "adding the 51st agent" needs to be split into two layers. At the current MCP layer, it simply means calling diary_write with a new agent_name; the system will naturally start writing into wing_<agent>/diary with no schema migration, no extra instance, and no central registry. In the fuller README experience, it would additionally involve an agent JSON file and a runtime discovery step.
Compare this to Letta's model: each agent has independent core memory (always-loaded key facts), recall memory (searchable history), and archival memory (long-term storage). All of these are managed via REST API and stored in the cloud. Adding an agent means creating an agent instance, configuring its memory blocks, setting up its system prompt, and managing its API keys. 50 agents means doing this 50 times, plus ongoing monthly fees.
MemPalace compresses all of this into one wing and one diary. Core memory is the most recent N diary entries. Recall memory is semantic search over the diary room. Archival memory is the other rooms in the wing. A three-layer memory architecture is implicit in the palace's spatial structure, requiring no explicit management.
Tunnels Between Agents: Natural Emergence of Shared Memory
An unexpected advantage of the palace architecture is knowledge connections between agents.
When the reviewer agent records auth.bypass.found|missing.middleware.check in wing_reviewer/diary, and the architect agent records auth.migration.decision|clerk>auth0|middleware.layer.critical in wing_architect/diary -- each writes in its own wing, without interference. But mempalace_search("middleware") returns both records. mempalace_find_tunnels("wing_reviewer", "wing_architect") discovers they share a "diary" room (albeit with different content).
The current implementation boundary matters here too. searcher.py supports only wing and room filters; it does not support diary-specific filtering like topic=auth, and traverse aggregates by room name rather than topic. So MemPalace already provides "multiple agents can keep memory inside the same palace and be searched together" at one level, but it does not yet provide a finer-grained agent-topic orchestration layer.
Diary Reading: Recent Self-Review
The implementation of tool_diary_read (mcp_server.py:395-436) reveals the diary system's final design detail:
def tool_diary_read(agent_name, last_n=10):
wing = f"wing_{agent_name.lower().replace(' ', '_')}"
col = _get_collection()
results = col.get(
where={"$and": [{"wing": wing}, {"room": "diary"}]},
include=["documents", "metadatas"],
)
# ... sort by timestamp, return latest N
It uses $and conditions to precisely locate the agent's diary -- wing matches the agent name, room is fixed as "diary." Then it sorts by timestamp in descending order and returns the most recent N entries.
The default value last_n=10 is a considered choice. Too few (say 3), and the agent loses trend awareness -- it can't see "this problem keeps recurring." Too many (say 50), and a recent self-review starts to consume unnecessary context budget. There is no need to pretend the code contains an exact token manager here; what the source really does is simpler: return the latest N entries and let the caller decide how to consume them.
The total field in the return structure tells the agent how long its complete diary is:
return {
"agent": agent_name,
"entries": entries,
"total": len(results["ids"]),
"showing": len(entries),
}
An agent with 200 historical diary entries but showing only 10 knows it has rich history but currently sees only the most recent slice. If it needs earlier memories, it can do semantic search within its own wing via mempalace_search. The diary is a fast lane for recent review; search is the backup path for deep recall.
Three Layers Deep: The Meaning Stack of Agent Architecture
Stacking up the three layers of what an agent means:
First layer: storage. An agent is a wing. A wing is a metadata tag in ChromaDB, not a separate database. Adding an agent means adding a tag value -- the system's complexity doesn't increase. This is linear scaling from 0 to N agents -- but the cost function's slope is zero (not counting storage itself).
Second layer: cognition. If the caller follows the tool description and writes diary entries in AAAK, then the diary records not just facts but patterns (pattern:3rd.time), importance assessments (star ratings), and emotion markers (*fierce*, *raw*). When an agent reads those entries in a new session, it is not merely recalling what happened -- it is rebuilding its understanding of the domain. A reviewer agent that has reviewed 200 PRs, after reading 10 recent compressed diary entries, has sharper code quality perception than a fresh AI with no history.
Third layer: ecosystem. Multiple agents accumulate domain expertise in the same palace, and their memories are connected through the palace's search and navigation infrastructure. Bug patterns found by the reviewer may relate to the architect's design decisions; incidents recorded by ops may corroborate the reviewer's code quality concerns. These connections don't need to be manually established -- they emerge naturally through shared semantic space and namespace.
These three layers together answer a bigger question: where should AI agent memory live?
In CLAUDE.md? That's configuration bloat -- every additional agent means another section in the configuration file. In separate databases? That's infrastructure bloat -- every additional agent means another storage instance to manage. In cloud services? That's cost bloat -- every additional agent means another line item on the monthly bill.
In one wing of the palace? That's a tag. The palace already has search, navigation, knowledge graph, and compression. The agent is simply a new consumer of these existing capabilities. It doesn't add infrastructure, doesn't add configuration, doesn't add monthly fees. It only adds memory -- and memory is the very reason the palace exists.
Chapter 21: Local Model Integration
Positioning: This chapter explains how MemPalace can run for long periods in an offline environment once local dependencies and default embedding assets are prepared -- from ChromaDB to local models to AAAK compression -- and why the entire stack was designed from day one with "no continuous network requirement" as a hard constraint rather than an optional feature.
Offline Is Not a Degraded Mode
Most AI memory systems treat offline support as a degraded mode: the cloud version offers full functionality, while local operation trades away some features, some performance, and some fees. Mem0's core is a cloud API, with self-hosting available only as an enterprise option. Zep's knowledge graph runs on Neo4j, which can be set up locally but is recommended for cloud instances.
MemPalace's design direction is the complete opposite: the main path is local, and cloud enhancement is a side path. ChromaDB is an embedded vector database with data stored on the local filesystem. The knowledge graph uses SQLite, also a local file. AAAK compression is pure string manipulation, dependent on no external service. The MCP server runs over the stdio channel, involving no network. More precisely: once local dependencies and default embedding assets are prepared, you can store, search, wake up, and query the knowledge graph on a disconnected laptop; only benchmark paths like Haiku / Sonnet rerank add a cloud model.
This design isn't technical purism. It stems from a judgment about the nature of memory data: personal memory is one of the most sensitive data types, and it should not require users to trust any third party. Your technical decisions, team dynamics, personal preferences, project progress -- the aggregate of this information is more sensitive than any single document because it paints a complete portrait of your work. Hosting this portrait on someone else's server requires a very strong reason. And "convenience" is not a strong enough reason.
The topic of this chapter isn't "how to install and configure locally" -- that's documentation's job. The topic is: when the entire stack is local, what does the integration path between AI and memory look like?
Path One: The Wake-Up Command
Look at the cmd_wakeup implementation at cli.py:107-118:
def cmd_wakeup(args):
"""Show L0 (identity) + L1 (essential story)
— the wake-up context."""
from .layers import MemoryStack
palace_path = (os.path.expanduser(args.palace)
if args.palace
else MempalaceConfig().palace_path)
stack = MemoryStack(palace_path=palace_path)
text = stack.wake_up(wing=args.wing)
tokens = len(text) // 4
print(f"Wake-up text (~{tokens} tokens):")
print("=" * 50)
print(text)
It does one simple thing: extracts L0 (identity) and L1 (key facts) from the palace and outputs them to the terminal. The user copies this text into the local model's system prompt, and the model now possesses the palace's core memory.
Command-line usage:
mempalace wake-up > context.txt
# Paste contents of context.txt into the local model's system prompt
mempalace wake-up --wing driftwood > context.txt
# Project-specific wake-up context
The internal logic of MemoryStack.wake_up() (layers.py:380-399) has two steps. Step one loads L0: reads ~/.mempalace/identity.txt -- a user-written plain text file defining the AI's identity. Step two generates L1: pulls the 15 most important memories from ChromaDB (sorted by importance), groups them by room, truncates to a 3200-character limit, and formats them into compact text blocks.
By the current source baseline, that output is typically ~600-900 tokens, and the CLI estimates it with a simple len(text) // 4 heuristic before printing. The ~170 token number that appears in the README describes a more aggressive later path: rewrite L1 into AAAK and then use that smaller representation for wake-up. In other words, "usable with local models" and "wake-up is only 170 tokens" are two different claims; the former is already true, the latter is not yet the default command path.
Why the name "wake-up" instead of "load-context" or "get-summary"? Because the semantics of this operation aren't "fetch data" but "wake up an agent with memory." When the local model loads this L0 + L1 text, it transforms from a generic model that knows nothing about the user into a personal assistant that knows who the user is, what projects they're working on, and what they care about. This is identity injection, not data transfer.
Path Two: Python API
The wake-up command suits manual workflows -- the user switches back and forth between the terminal and a local model. But if you're building an automated pipeline -- say a locally running agent framework, or a custom chat interface -- you need a programming interface.
The search_memories function at searcher.py:87-142 is the core of this interface:
from mempalace.searcher import search_memories
results = search_memories(
"auth decisions",
palace_path="~/.mempalace/palace",
wing="driftwood",
)
# results = {
# "query": "auth decisions",
# "filters": {"wing": "driftwood", "room": None},
# "results": [
# {"text": "...", "wing": "...", "room": "...",
# "source_file": "...", "similarity": 0.87},
# ...
# ]
# }
It returns a dictionary rather than printing to the terminal. The results list in the dictionary contains the original text, spatial coordinates, source file, and similarity score for each matching memory. The caller takes this dictionary and injects the memory text into the prompt sent to the local model.
A typical integration pattern:
from mempalace.searcher import search_memories
from mempalace.layers import MemoryStack
# 1. Load wake-up context
stack = MemoryStack()
wakeup = stack.wake_up()
# 2. Search for relevant memories on demand
results = search_memories("auth migration timeline")
memories = "\n".join(r["text"] for r in results["results"])
# 3. Assemble prompt, send to local model
prompt = f"""## Your Memory
{wakeup}
## Relevant Memories
{memories}
## User Question
Why did we choose Clerk over Auth0?
"""
# response = local_model.generate(prompt)
This code doesn't depend on any network request. MemoryStack reads data from local ChromaDB, search_memories does vector retrieval locally, prompt assembly is pure string concatenation, and local_model.generate calls locally running model inference. The entire chain completes end-to-end on the local machine.
Note that search_memories and the MCP server's tool_search actually call the same function (mcp_server.py:173-180). The MCP path and the Python API path converge at the same retrieval engine underneath. This means memories found through MCP with Claude and memories injected into local models through the Python API come from exactly the same data source and retrieval logic. There's no such thing as "MCP-version memories are better."
Anatomy of an Offline-Capable Stack
Putting all components together, here's what an offline-capable MemPalace stack looks like after cold-start preparation:
Storage layer: ChromaDB (embedded) + SQLite. ChromaDB stores vector embeddings and documents on the local filesystem, defaulting to ~/.mempalace/palace. What the repository directly proves is that MemPalace does not configure an external embedding service by default; it relies on ChromaDB's local embedding path. After that initial asset-preparation step, the path can keep running offline. The knowledge graph uses SQLite, stored at ~/.mempalace/knowledge_graph.sqlite3. Both databases combined have minimal disk requirements -- a palace with 22,000 memories, including all data and indexes, is approximately 200-300MB.
Compression layer: AAAK dialect. Purely rule-driven text compression, dependent on no model. Entity names are replaced with three-letter codes, structured into pipe-delimited format, emotions marked with asterisks. A 30x compression ratio means 3000 tokens of natural language memory can be compressed to 100 tokens. This is especially important for local models with small context windows -- a 4K context model, after deducting system prompt and user input, may only have 2K tokens left for memory. AAAK lets those 2K tokens hold what would otherwise require 60K tokens.
Interface layer: CLI + Python API. mempalace wake-up outputs wake-up text, mempalace search outputs search results. Both commands output plain text that can be injected into any model via piping, redirection, or copy-paste. The Python API provides programmatic access, returning structured data for automated pipelines.
Inference layer: the user's chosen local model. MemPalace isn't tied to any specific model. Its output is text -- any model that can read text can consume it. This isn't a stance of technical neutrality but a natural consequence of architectural constraints: when your output format is plain text, your consumer can be any text processor, whether a 70B-parameter Llama or a 7B Mistral, whether local inference or an API call.
Why This Design: Two Key Decisions
Looking back at this offline stack, two design decisions deserve deeper analysis.
Decision One: Text as Interface, Not Tool Calls
Under the MCP path, AI accesses memory through tool calls -- structured input parameters, structured JSON returns. But under the local model path, the interface degrades to plain text. Wake-up output is text, search output is text, AAAK is text.
This appears to be a downgrade -- from a structured API to string copy-paste. But in reality, text is the most universally compatible interface format. JSON APIs require the consumer to understand the schema. Tool calls require the consumer to implement the MCP protocol. Text only requires the consumer to be able to read.
The deeper significance of this choice is: it doesn't require local models to have any "special capabilities." No function calling support needed, no tool use training needed, no JSON mode needed. A 7B model that's only been through basic text generation training can consume the current wake-up text directly. Today that usually costs about 600-900 tokens; if the AAAK wake-up path lands later, the barrier falls further.
Decision Two: AAAK Is a Plain Text Protocol, Not an Encoding Format
There's an easily overlooked key property in AAAK's design: it doesn't need a decoder.
Compare other compression approaches. If you compress memory text with gzip, you get extremely high compression ratios, but LLMs can't directly read gzip binary. If you use custom token encoding -- say mapping "Alice" to a special token -- you need to modify the model's vocabulary or do a decoding pass before inference.
AAAK needs neither. ALC=Alice is a readable mapping. | is a visible delimiter. ★★★★ is an intuitively understandable rating. Any LLM -- regardless of its training data, vocabulary, or inference framework -- can directly read AAAK text and correctly understand its meaning.
This is the foundational assumption that makes the entire local stack viable. If AAAK required a decoding step, a preprocessor would need to be inserted into the local model's inference pipeline. A preprocessor means additional code, additional dependencies, additional failure points. Plain-text AAAK eliminates this layer -- memory flows from storage to consumption as an end-to-end plain text stream with no conversion steps in between.
Trade-offs Between the Two Paths
The wake-up path and the Python API path aren't alternatives but complements. They serve different use cases.
The wake-up path suits interactive use. The user sits at the terminal, starts a new conversation, runs mempalace wake-up, pastes the output into the model's context, then begins the conversation. The entire process takes about 10 seconds, with an additional 600-900 tokens consumed under the current implementation. Suitable for everyday Q&A, brainstorming, and code review. Its advantage is zero integration cost -- no code to write, no configuration to change, no pipeline to build. The README's lighter ~170 token number belongs to the next optimization stage of the same workflow.
The Python API path suits automated pipelines. A developer builds a custom agent framework -- perhaps a LangChain-based workflow, a custom CLI tool, or an IDE plugin -- using search_memories to automatically retrieve relevant memories before each conversation and inject them into the prompt. Additional token consumption depends on the number and length of search results, typically 500-2000 tokens. Suitable for scenarios requiring deep memory integration -- project retrospectives, decision tracing, knowledge base queries.
Both paths share the same palace. Memories seen in wake-up and memories retrieved via the API come from the same ChromaDB instance. Switching paths requires no data migration, no re-indexing, no format conversion. The palace is the single source of truth -- the access method is interchangeable.
The Cost and Return of Going Offline
To be honest, running completely offline has costs.
The cost of embedding quality. ChromaDB's default all-MiniLM-L6-v2 is a small embedding model. Its semantic understanding capability doesn't match OpenAI's text-embedding-3-large or Cohere's embed-v3. In extreme semantic matching scenarios -- such as searching for a memory containing "Auth0's pricing became unsustainable when users exceeded ten thousand" with the query "why did we abandon the old authentication system" -- the small model might miss what a large model wouldn't. MemPalace compensates for this gap through palace structure filtering: when you tell the search "look in the auth-migration room of wing_driftwood," the search space shrinks to a few dozen memories, and the small model's accuracy within this range is comparable to the large model's. This is also why the palace structure delivers a 34% retrieval improvement -- structure compensates for the model.
The cost of reasoning capability. Local models' reasoning capabilities are typically weaker than cloud-based large models. A 7B-parameter model may not be able to precisely understand pattern markers in AAAK diary entries, correctly infer temporal relationships, or judge between multiple contradictory memories the way Claude can. But MemPalace's design philosophy is: let the storage layer do storage's job, and let the reasoning layer do reasoning's job. If memories are correctly retrieved and presented to the model, even if the model's reasoning capability is limited, it's at least reasoning on a correct factual basis. This is far better than a response with strong reasoning ability but based on hallucinations.
The returns are certain. Privacy protection -- your memories never leave your machine. Zero operating cost -- aside from electricity, there are no monthly fees. Unlimited availability -- no dependency on network connectivity, no API rate limiting, no loss of memory due to service outages. And a deeper return: sovereignty. Your memory system is unaffected by any third party's pricing decisions, privacy policy changes, or service shutdowns. It runs on your hard drive, with your chosen model, outputting text you control.
This isn't a trade-off every user needs. If your memory content isn't sensitive, the convenience of cloud solutions may be more valuable. But for users whose memory content involves business decisions, team dynamics, and personal life -- and these are precisely the most valuable use cases for a memory system -- offline capability isn't an optional feature but a prerequisite.
MemPalace's entire technology stack is designed around this prerequisite. ChromaDB instead of Pinecone, SQLite instead of Neo4j, AAAK instead of GPT summaries, stdio instead of HTTP. Every technical choice points in the same direction: your memory should belong entirely to you, whether or not you're connected to the internet.
Chapter 22: Benchmark Methodology
Positioning: This chapter explains why these three benchmarks were chosen, what capability dimension each tests, where their blind spots are, and how anyone can reproduce all results in five minutes. The most honest way to validate a system isn't to show its report card -- it's to publish the exam itself.
Why Three Benchmarks Are Needed
A single benchmark can only answer a single question. A system that scores 96.6% may simply happen to excel at that particular type of question.
This isn't hypothetical. LongMemEval is the most standard test in the AI memory field -- 500 questions spanning 53 conversation sessions, covering six question types. MemPalace scored 96.6% R@5 on it. That score is headline-worthy, but it only answers one question: given a pile of conversation history, can you find which session the answer is hiding in?
It doesn't answer: can you string together clues across multiple sessions? It doesn't answer: when data scale balloons from 53 sessions to thousands of sessions, will performance collapse? It also doesn't answer: across different types of memory -- facts, preferences, changes, reasoning -- is your performance uniform?
So we chose three benchmarks. Not because three numbers look better than one, but because each benchmark tests a completely different cognitive capability. Their intersection covers three core dimensions of AI memory systems: precise retrieval, multi-hop reasoning, and large-scale generalization. Their blind spots -- what each benchmark can't test -- are equally important and are analyzed one by one in this chapter.
LongMemEval: Needle in a Haystack
What It Is
LongMemEval is a standardized memory evaluation dataset designed by academia. 500 questions, each corresponding to a "haystack" -- the history of 53 conversation sessions -- and a "needle" -- the correct answer hidden in one or more of those sessions.
The core capability tested is information localization: given a natural language question, can your system find the session containing the answer among 53 sessions? No need to generate the answer, no need to understand the answer -- just rank the correct session to the top.
Six Question Types
LongMemEval's 500 questions cover six question types, each testing a different retrieval difficulty:
| Type | Count | Description | MemPalace Baseline |
|---|---|---|---|
| knowledge-update | 78 | Facts change over time -- current answer supersedes old answer | 99.0% |
| multi-session | 133 | Answer scattered across multiple sessions | 98.5% |
| temporal-reasoning | 133 | Contains time anchors -- "last month," "two weeks ago" | 96.2% |
| single-session-user | 70 | Answer is in something the user said | 95.7% |
| single-session-preference | 30 | User's indirectly expressed preferences | 93.3% |
| single-session-assistant | 56 | Answer is in the AI assistant's response | 92.9% |
The strongest two categories -- knowledge-update and multi-session -- are precisely MemPalace's design sweet spots. When facts are updated, the original text retains both old and new versions, and the search model naturally matches sessions containing updates. When answers are scattered across multiple sessions, verbatim storage means each session fully preserves its portion of information, and semantic search can hit each one.
The weakest two categories reveal deeper issues. single-session-preference (93.3%) is weak due to the indirectness of preference expression: the user says "I think Postgres is more reliable in concurrent scenarios," and the question asks "What database does the user prefer?" -- the vocabulary doesn't overlap at all, and the embedding model can't see the connection. single-session-assistant (92.9%) is weak due to an indexing gap: by default only user utterances are indexed, but the question asks "What did the AI recommend?" -- the answer simply isn't in the search pool.
Both weaknesses were later fixed. The preference gap was bridged by extracting preference expressions through 16 regex patterns. The assistant gap was solved through two-stage retrieval -- first using user utterances to locate the session, then searching assistant utterances within the target session. After fixes, the score progressed from 96.6% to 99.4%, then to 100%.
Why It Was Chosen
LongMemEval is currently the most widely cited benchmark in the AI memory field. Supermemory, Mastra, Mem0, Hindsight -- all major competitors have reported scores on this benchmark. This means scores are directly comparable. If your R@5 on LongMemEval is 96.6% and Mastra's is 94.87%, these two numbers use the same ruler.
Its data is public -- hosted on HuggingFace, downloadable by anyone. Its evaluation metrics are standardized -- Recall@K and NDCG@K have clear mathematical definitions. These properties make it an ideal choice for reproducible benchmarking.
Blind Spots
LongMemEval has three significant blind spots.
Blind spot one: too small in scale. 53 sessions is a very small search space. A real user's six months of AI usage would produce hundreds of conversation sessions. Ranking first among 53 sessions and ranking first among 500 sessions are completely different tasks. Can LongMemEval's 96.6% be maintained at ten times the scale? This question it cannot answer.
Blind spot two: doesn't test reasoning. LongMemEval only tests retrieval, not understanding. Its metric is "does the correct session appear in the top-K results," not "can the system correctly answer the question using retrieved content." A system that returns all 53 sessions could theoretically score 100% Recall@53 -- but it hasn't "understood" anything.
Blind spot three: doesn't test cross-session reasoning. Although multi-session type questions have answers scattered across multiple sessions, the evaluation criterion is "any one relevant session appearing in top-K" counts as correct. It doesn't test the ability to "connect information from multiple sessions to reach a conclusion."
LoCoMo: Multi-Hop Reasoning
What It Is
LoCoMo (Long Conversational Memory) comes from Snap Research and is a benchmark specifically designed for multi-hop reasoning. 10 long conversations, each containing 19-32 sessions, 400-600 turns of dialogue, producing a total of 1,986 QA pairs.
What does "multi-hop reasoning" mean? Consider this scenario:
- Session 5: Caroline mentions she's studying marine biology
- Session 12: Caroline says she found a related research position
- Session 19: Question -- "What is Caroline's career direction?"
To answer this question, the system needs to connect information from sessions 5 and 12. Retrieving either one alone isn't enough -- you need both to piece together the complete picture. This is what "multi-hop" means: the answer isn't in any single place but distributed across multiple locations, requiring reasoning across multiple information nodes.
Five Question Types
LoCoMo's 1,986 questions fall into five types:
| Type | Description | MemPalace Baseline (R@10) |
|---|---|---|
| single-hop | Answer is in one session | 59.0% |
| temporal | Involves time relationships | 69.2% |
| temporal-inference | Requires cross-session temporal reasoning | 46.0% |
| open-domain | Open-ended questions | 58.1% |
| adversarial | Deliberately confusing questions -- asks about A, but B has said more | 61.9% |
The hardest category is temporal-inference -- requiring temporal causal relationships to be established across multiple sessions. The baseline is only 46.0%. This means over half of cross-temporal reasoning questions cannot be answered correctly by pure semantic retrieval.
The adversarial category reveals an interesting challenge: when two people appear in the same conversation, the embedding model can't distinguish "who said what." If the question asks about Caroline's research direction but Melanie said more in the same session, the embedding model might rank Melanie-dominated sessions higher -- even though Caroline's key information is in another session.
Why It Was Chosen
LoCoMo fills LongMemEval's core blind spot: cross-session reasoning. LongMemEval asks "can you find the correct session," while LoCoMo asks "can you understand relationships between sessions."
It also has an important design feature: the number of sessions per conversation (19-32) is closer to real user data scales. While still not large, it's closer to the real-world scenario of "each project independently accumulating conversation history" than LongMemEval's 53 shared sessions.
Blind Spots
Blind spot one: too few conversations. Only 10 conversations. This means a single conversation's anomalous performance can severely impact the total score. If one conversation's topic distribution happens to be particularly unfavorable for your system, the total score could drop 5-10 percentage points.
Blind spot two: all conversations are fictional. LoCoMo's conversations are artificially written simulations, not real users' AI interaction records. Fictional conversations' language patterns, topic distributions, and information density may systematically differ from real conversations.
Blind spot three: each conversation has only two speakers. Real-world scenarios may involve multiple people in a conversation -- team standups, group discussions, multi-party decisions. LoCoMo only has two-person conversations and doesn't test multi-party information interweaving.
ConvoMem: Large-Scale Coverage
What It Is
ConvoMem comes from Salesforce Research and is currently the largest conversational memory benchmark -- 75,336 QA pairs covering six different memory types. It doesn't test deep reasoning -- it tests breadth and type coverage.
Six Categories
| Category | Description | MemPalace R@K |
|---|---|---|
| assistant_facts_evidence | Facts stated by the AI assistant | 100% |
| user_evidence | Facts stated by the user | 98.0% |
| abstention_evidence | Questions the system should refuse to answer | 91.0% |
| implicit_connection_evidence | Implicit connections requiring inference | 89.3% |
| preference_evidence | User preferences and habits | 86.0% |
| changing_evidence | Facts that change over time | -- |
Scoring 100% on assistant_facts_evidence isn't surprising -- ConvoMem's testing method checks whether retrieval results contain the evidence message, and MemPalace stores every message verbatim (including assistant responses), naturally hitting on search.
preference_evidence is the weakest category (86.0%), for the same reason as LongMemEval's preference category: preferences are often expressed in indirect language, and embedding models struggle to establish connections between questions and expressions.
Why It Was Chosen
ConvoMem fills a dimension missing from both other benchmarks: type coverage. LongMemEval mainly tests fact retrieval, LoCoMo mainly tests reasoning ability, and ConvoMem divides "memory" into six distinct types, testing each separately. This is important because a system that excels at fact retrieval may perform completely differently on preference memory or implicit reasoning.
Its scale (75K+ QA pairs) also provides statistical significance: when you have seventy-five thousand data points, the difference between 86% in one category and 100% in another is real, not noise.
Blind Spots
Blind spot one: short context per QA pair. Many of ConvoMem's test items involve only a few messages of context, unlike LongMemEval which requires searching across 53 sessions. This means it tests "short-range matching" more than "long-range retrieval."
Blind spot two: uneven category weights. Some categories have far more samples than others. The weighted average of 92.9% may mask weaknesses in small categories.
Blind spot three: doesn't test real memory retention. ConvoMem assumes all conversation content has been correctly stored and only tests retrieval capability. It doesn't test real-world problems like "does storage quality degrade over six months of continuous use."
Complementarity of the Three Benchmarks
Looking at all three benchmarks together, they form a triangulation:
| Dimension | LongMemEval | LoCoMo | ConvoMem |
|---|---|---|---|
| Core capability | Precise retrieval | Multi-hop reasoning | Type coverage |
| Data scale | 500 questions | 1,986 QA pairs | 75,336 QA pairs |
| Session scale | 53 shared | 19-32 per conversation | Short context |
| Reasoning depth | Shallow (localization) | Deep (reasoning) | Medium (classification) |
| Competitor comparison | Extensive | Limited | Limited |
| Data source | Academic design | Human simulation | Academic design |
| Reproducibility | Public dataset | Public dataset | Public dataset |
LongMemEval is the yardstick -- everyone uses it, scores are directly comparable, and it's the entry ticket proving a system's basic capability.
LoCoMo is the litmus test -- it tests reasoning capability that LongMemEval cannot, and it's the benchmark most likely to expose system weaknesses. MemPalace's baseline on LoCoMo is only 60.3% -- this score isn't headline material, but it honestly reflects the limitations of pure semantic retrieval on multi-hop reasoning tasks.
ConvoMem is the wide-angle lens -- it doesn't go deep on any single capability but has the broadest coverage, ensuring the system isn't overspecialized on just one question type.
The intersection of the three covers a complete evaluation space: if a system has precise retrieval on LongMemEval, adequate reasoning capability on LoCoMo, and balanced performance across types on ConvoMem, then you have reasonable confidence it will work in real-world scenarios. If a system scores high on only one of these benchmarks, you should remain skeptical.
What All Three Miss
Triangulation covers many dimensions, but some critical capabilities are entirely outside the testing scope:
Real time spans. All three benchmarks are static datasets. They simulate "existing conversation history," not "memory gradually accumulated over six months." In real use, a memory system faces incrementally growing data -- a few new sessions added each day, continuously expanding indexes -- does retrieval quality degrade over time? This question cannot be answered with static benchmarks.
Write correctness. All three benchmarks assume data has been correctly stored. They don't test the mining stage -- splitting, deduplication, classification, metadata extraction. If MemPalace's convo_miner incorrectly merges two sessions or assigns a conversation to the wrong wing, the benchmark won't catch this error.
End-to-end answer quality. Recall@K measures "is the correct session in the top-K," not "can the system correctly answer the question using retrieved content." A system with perfect retrieval but failed answer generation would still score full marks on all three benchmarks. Complete end-to-end evaluation requires introducing an LLM to generate answers and computing F1 scores -- this requires an API key and means you're no longer testing just the memory system but also the LLM's own capabilities.
Multi-modal content. All three benchmarks are pure text. Code snippets, error stacks, screenshot descriptions, and links that appear in real conversations have different retrieval characteristics from natural language, but none of this falls within any benchmark's test scope.
Runner Code Structure: How to Reproduce
All benchmark runner code is in the benchmarks/ directory, one Python file per benchmark. The design principle is: clone, install, run -- three steps to reproduce, no configuration changes needed.
Directory Structure
benchmarks/
longmemeval_bench.py -- LongMemEval runner, all modes
locomo_bench.py -- LoCoMo runner
convomem_bench.py -- ConvoMem runner
membench_bench.py -- MemBench runner (extra)
BENCHMARKS.md -- Complete results and methodology documentation
HYBRID_MODE.md -- Technical details of hybrid retrieval mode
README.md -- Quick reproduction guide
results_*.jsonl -- Raw results from each run
Core Flow of longmemeval_bench.py
This is the most important runner because LongMemEval is the primary battlefield for competitor comparison. Its core loop works like this:
For each of the 500 questions:
-
Load the haystack: load all 53 sessions corresponding to that question into a fresh ChromaDB collection. Uses
EphemeralClient-- in-memory mode, no disk IO, no SQLite handle leaks. The collection is cleared and rebuilt between each question. -
Execute retrieval: query the collection with the question text. Select the retrieval strategy based on the
--modeparameter -- raw (pure semantic), hybrid (keyword-enhanced), hybrid_v2 (with time enhancement), palace (palace structure navigation), diary (topic summary enhancement). -
Evaluate ranking: compare the returned document list against the ground-truth correct session IDs. Compute Recall@5, Recall@10, NDCG@10.
-
Record details: retrieval results for every question -- including every returned document, its distance score, and whether it was a hit -- are all written to a JSONL file. This means you can not only reproduce the total score but also audit every individual question.
The key design decision is using a global singleton of chromadb.EphemeralClient(). Earlier versions used PersistentClient with temporary directories, which would hang around question 388 due to SQLite handle accumulation. Switching to in-memory mode solved this problem while delivering roughly 2x speed improvement -- completing all 500 questions takes about 5 minutes on Apple Silicon.
Core Flow of locomo_bench.py
LoCoMo's structure is slightly different because its data is organized as "10 independent conversations, each with its own QA pairs":
For each of the 10 conversations:
- Load the conversation: load all sessions (19-32) of that conversation into ChromaDB.
- Ask questions one by one: query with each QA pair of that conversation.
- Evaluate: check whether retrieved sessions contain the ground-truth evidence dialogue.
- Statistics by type: compute recall separately for each of the five question types.
A notable technical detail: LoCoMo's ground-truth annotations are at the dialog level (single turn), but MemPalace's indexing granularity is at the session level (a session contains multiple turns). The runner controls evaluation granularity via the --granularity parameter. Session granularity scores (60.3%) are higher than dialog granularity (48.0%) because a session is a coarser container -- hitting a session containing evidence is easier than hitting the specific turn containing evidence.
Core Flow of convomem_bench.py
ConvoMem's distinguishing feature is that its data is distributed across multiple files on HuggingFace, and the runner needs to download before testing:
- Discover files: list available data files for each category via the HuggingFace API.
- Download and cache: download from HuggingFace on first run, cache locally to avoid repeated downloads.
- Sample: control how many test items are sampled per category via the
--limitparameter. Default is 50. - Test: for each test item, load the conversation history into ChromaDB, query with the question, and check whether retrieval results contain the evidence message.
Quick Reproduction
# Install
git clone https://github.com/aya-thekeeper/mempal.git
cd mempal && pip install chromadb pyyaml
# LongMemEval (~5 minutes)
mkdir -p /tmp/longmemeval-data
curl -fsSL -o /tmp/longmemeval-data/longmemeval_s_cleaned.json \
https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
python benchmarks/longmemeval_bench.py /tmp/longmemeval-data/longmemeval_s_cleaned.json
# LoCoMo (~2 minutes)
git clone https://github.com/snap-research/locomo.git /tmp/locomo
python benchmarks/locomo_bench.py /tmp/locomo/data/locomo10.json --granularity session
# ConvoMem (~2 minutes)
python benchmarks/convomem_bench.py --category all --limit 50
The raw baseline path needs no API key and no GPU. After benchmark data and default embedding assets are prepared, it can be rerun offline. The important caveat is that enabling diary or LLM rerank paths also requires network access and an API key. No complex configuration files are needed.
Auditability of Results
Every run generates a JSONL or JSON result file containing:
- The complete text of each question
- Every retrieved document and its distance score
- The hit/miss determination for each question
- Statistics broken down by question type
This means when someone questions a particular score, you can open the result file, find that specific question, see every document retrieval returned, and verify the evaluation logic one by one. This isn't a black box -- every layer is transparent.
What the Metrics Mean
Recall@K and NDCG@K are standard metrics in the information retrieval field, but for non-specialist readers, their intuitive meanings need explanation.
Recall@K: among the top K results returned, what proportion of correct answers were found? R@5 = 96.6% means: for 483 of the 500 questions, the correct session appeared in the top 5 retrieval results. For the remaining 17 questions, the correct session was not in the top 5.
NDCG@K (Normalized Discounted Cumulative Gain): considers not just whether the correct answer is in the top-K but also its rank position. A correct answer ranked 1st scores higher than one ranked 5th. NDCG@10 = 0.889 means: correct answers not only frequently appear in the top 10 but tend to appear in earlier positions.
In practical use, R@5 is the more important metric. Because when your AI assistant calls mempalace_search, it typically only looks at the top 5 results. If the correct answer is in 6th place, the AI can't see it -- equivalent to not finding it.
The Methodology's Promise
The three benchmarks, runner code, data sources, and evaluation metrics described in this chapter constitute a complete reproducible evaluation framework. Anyone -- whether they want to verify MemPalace's claims, run the same tests on their own system, or understand what these scores actually mean -- can start from here.
But the scores are only half the story. The next chapter places MemPalace's scores in the competitive landscape, making head-to-head comparisons with Supermemory, Mastra, Mem0, Zep, and other systems. We'll show where it wins, where it loses, and why some "losses" are more meaningful than they appear on the surface.
Chapter 23: An Honest Comparison with Competitors
Positioning: This chapter places MemPalace within the competitive landscape, comparing system by system, dimension by dimension. Where it wins, the results are presented factually. Where it loses, the reasons are analyzed with equal honesty. No marketing language, no belittling of competitors, no hiding of weaknesses.
The Report Card First
Below is a direct comparison of LongMemEval R@5, with all data sourced from each system's public reports or reproducible benchmark runs:
| System | LongMemEval R@5 | API Dependency | Cost |
|---|---|---|---|
| MemPalace (hybrid v4 + rerank) | 100% | Optional (Haiku) | Free + ~$0.001/query |
| Supermemory ASMR | ~99% | Yes | Undisclosed |
| MemPalace (raw) | 96.6% | None | Free |
| Mastra | 94.87% | Yes (GPT-5-mini) | API cost |
| Mem0 | ~85% | Yes | $19-249/month |
| Zep | ~85% | Yes | From $25/month |
This table is real. But if you look only at this table and conclude "MemPalace crushes everything," you're missing a lot of important context.
Comparison Across Four Dimensions
A one-dimensional score ranking is dangerous. It hides fundamental architectural differences between systems, forcing products with different design philosophies onto the same ruler. A more honest comparison requires at least four dimensions.
Dimension One: Accuracy
LongMemEval is the most standard comparison battlefield, and the table above already shows the results. But looking only at LongMemEval is far from enough.
ConvoMem (75K+ QA pairs) comparison:
| System | ConvoMem Score | Notes |
|---|---|---|
| MemPalace | 92.9% | Verbatim storage + semantic search |
| Gemini (long context) | 70-82% | Puts entire history into context window |
| Block extraction | 57-71% | LLM-processed block extraction |
| Mem0 (RAG) | 30-45% | LLM-extracted memories |
MemPalace exceeds Mem0 by more than double on ConvoMem. This isn't a marginal advantage -- it's two-fold. The reason deserves deep analysis: Mem0 uses an LLM to decide "what's worth remembering," then saves only the extracted facts. When the LLM extracts the wrong thing or misses a critical detail, that portion of memory is permanently lost. MemPalace's verbatim storage does no filtering -- it doesn't judge what's important and what isn't -- so the "incorrect extraction" failure mode doesn't exist.
But now let's look at where MemPalace performs poorly.
LoCoMo (1,986 multi-hop QA pairs): an honest analysis of 60.3%.
MemPalace's baseline score on LoCoMo is 60.3% R@10 (session granularity, no rerank). This score isn't good. It means that in four out of ten multi-hop reasoning questions, MemPalace couldn't even rank the correct session in the top ten.
Why?
LoCoMo tests a capability that MemPalace's fundamental architecture isn't built for: cross-session information chaining. Consider a typical LoCoMo question: "What field did Caroline find work in?" The answer requires connecting session 5 (she mentioned interest in marine biology) and session 12 (she said she received a research position offer). But MemPalace's semantic search scores each session independently -- it doesn't know sessions 5 and 12 have a causal relationship. The key words "field" and "work" in the question have weak semantic associations with two different sessions, but not enough to rank either into the top-10.
Breaking down performance by category more specifically:
| Category | R@10 (Baseline) | Notes |
|---|---|---|
| temporal | 69.2% | Best -- temporal relationships are the most direct retrieval signal |
| adversarial | 61.9% | Severe speaker confusion |
| single-hop | 59.0% | Even single-hop is only 60% -- search space isn't precise enough |
| open-domain | 58.1% | Vocabulary matching is harder for open-ended questions |
| temporal-inference | 46.0% | Worst -- temporal questions requiring reasoning are near random level |
The 46.0% on temporal-inference approaches random guessing. This is MemPalace's most honest weakness: when answers require reasoning across multiple time points, pure vector retrieval essentially doesn't work.
However, it should be noted that competitor comparison data for LoCoMo is limited. Mem0, Zep, and Supermemory have not publicly reported LoCoMo scores. The known reference point is the Memori system's 81.95% (R@10), and MemPalace's hybrid v5 mode (88.9% R@10) exceeds it. But the 60.3% baseline is indeed not competitive.
There's also a structural issue that needs transparent disclosure: each LoCoMo conversation has only 19-32 sessions, and when using top-k=50 for retrieval, the candidate pool already includes all sessions -- at this point, Sonnet rerank is essentially doing reading comprehension, not retrieval. Therefore, the 100% score obtained with top-k=50 + Sonnet rerank has structural guarantees and should not be conflated with honest retrieval scores at top-k=10. The honest LoCoMo score is the result at top-10.
Dimension Two: Cost
This is one of MemPalace's core advantages and also the easiest dimension to quantify.
| System | Monthly Cost | Annual Cost | Cost Composition |
|---|---|---|---|
| MemPalace (raw) | $0 | $0 | No API calls |
| MemPalace (hybrid + rerank) | ~$0.30 | ~$3.60 | ~300 queries x $0.001/query |
| Mastra | Variable | Variable | GPT-5-mini API cost |
| Mem0 | $19-249 | $228-2,988 | Subscription |
| Zep | $25+ | $300+ | Subscription |
| Letta (MemGPT) | $20-200 | $240-2,400 | Subscription |
MemPalace's raw mode cost is zero. Literally zero in day-to-day operation. No API calls, no cloud services, no subscription fees. ChromaDB runs locally, and the current baseline uses ChromaDB's default local embedding path. More cautiously stated: the repository does not vendor the default embedding asset directly into the repo, so the most accurate claim remains that once the initial asset-preparation step is complete, day-to-day raw queries cost zero.
Even with the optional Haiku rerank, each query costs approximately $0.001 -- one dollar per thousand queries. Assuming an active user makes 10 memory searches per day, a month of 300 queries costs $0.30.
This cost differential isn't at the percentage level. Mem0's entry price ($19/month) is infinitely more than MemPalace's raw mode cost -- because the denominator is zero. Even compared to MemPalace's hybrid mode, Mem0's minimum annual cost ($228) is still 63 times that of MemPalace.
But to be fair, Mem0 and Zep's pricing includes things MemPalace doesn't provide: hosted infrastructure, management interfaces, team collaboration features, and SLA guarantees. For enterprise users, $25/month Zep may actually be cheaper than "free but self-managed" MemPalace -- because operational time has cost too.
Dimension Three: Privacy
| System | Data Location | API Communication | Privacy Model |
|---|---|---|---|
| MemPalace (raw) | Fully local | None | Data never leaves your machine |
| MemPalace (hybrid) | Primarily local | Only session fragments sent during rerank | Optional minimal data egress |
| Mem0 | Cloud | Full API | Vendor holds data |
| Zep | Cloud | Full API | SOC 2, HIPAA compliant |
| Supermemory | Cloud | Full API | Vendor holds data |
| Mastra | Depends on deployment | GPT API | OpenAI holds query data |
MemPalace's raw mode is one of the very few mainstream AI memory systems that gets close to "zero data egress" in day-to-day use. Not "we encrypt the data," not "we're SOC 2 compliant," but that the main raw search loop does not need to send queries and memories to a third-party API. ChromaDB runs locally, embedding computation is local, and search is local. More cautiously stated, other chapters in this book already note that default embedding assets still have an initial preparation step; after that, your conversation records -- containing technical decisions, internal discussions, code snippets, even personal preferences -- can remain on your disk.
The hybrid mode introduces a privacy trade-off: when LLM rerank is enabled, the first 500 characters of top-K candidate sessions are sent to Anthropic's API for reranking. This means a small amount of conversation content leaves your machine per query. But this is optional and controllable: you can choose not to use rerank and accept 96.6% accuracy, or use rerank to pursue higher accuracy.
Zep deserves special mention: it's the most serious commercial product in this space when it comes to privacy compliance. SOC 2 certification and HIPAA compliance mean it has undergone third-party auditing with legally binding data processing constraints. For users in regulated industries such as healthcare and finance, Zep's compliance may be more practical than MemPalace's "fully local" -- because "fully local" means compliance responsibility falls on the user.
Dimension Four: API Dependency
| System | Available without API | APIs Required | Offline Operation |
|---|---|---|---|
| MemPalace (raw) | Fully available | None | Offline after cold-start preparation |
| MemPalace (hybrid) | 96.6% available, 100% requires API | Anthropic (optional) | Partially offline |
| Mastra | Unavailable | OpenAI (GPT-5-mini) | Not supported |
| Mem0 | Unavailable | Own API + LLM API | Not supported |
| Zep | Unavailable | Own API + Graph DB | Not supported |
| Supermemory | Unavailable | Own LLM API | Not supported |
The comparison on this dimension is very clear: MemPalace is the only system that delivers competitive scores without any API key at all. 96.6% R@5 -- zero API calls -- already exceeds Mastra (94.87%, requires GPT-5-mini), Mem0 (~85%, requires paid subscription), and Zep (~85%, requires paid subscription).
This isn't a trivial property. API dependency means:
- Availability risk: when the API provider goes down, your memory system doesn't work at all. Between 2024-2025, major LLM APIs accumulated enough downtime to make this a real concern.
- Uncontrollable costs: API pricing is determined unilaterally by the provider. Your memory system's operating cost depends on a variable you can't control.
- Geographic restrictions: some regions can't access certain API providers. A memory system dependent on the OpenAI API is unusable in certain network environments.
- Data sovereignty: API calls mean data leaves the country. For compliance requirements of certain organizations and regions, this is a hard constraint.
Competitor-by-Competitor Analysis
Supermemory ASMR (~99% R@5)
Supermemory is the closest competitor to MemPalace. Its ASMR (Agentic Search with Memory Retrieval) architecture reports approximately 99% on LongMemEval -- but this is the experimental version's number; its production version is around 85%.
What Supermemory gets right: ASMR uses an LLM to run multiple search rounds -- when first-round retrieval results aren't satisfactory, the LLM reformulates the query and searches again. This agentic approach is particularly effective on semantically ambiguous queries: when the first search misses, the LLM can understand why it failed and adjust strategy.
MemPalace's comparative advantage: No API dependency. Each Supermemory search may trigger multiple LLM calls, with higher cost and latency. MemPalace's raw mode has zero cost and sub-second latency -- a notable gap in high-frequency search scenarios.
Fair judgment: If you don't care about cost and latency, Supermemory's agentic approach may be more flexible than MemPalace on certain complex queries. But if you care about privacy and offline capability, Supermemory isn't an option.
Mastra (94.87% R@5)
Mastra uses GPT-5-mini as an "observer" -- the LLM extracts observations in real-time as conversations happen, then stores these observations rather than the raw conversations.
What Mastra gets right: The LLM at the extraction stage can understand conversation structure and make implicit information explicit. If the user says "that Postgres issue last time gave me a headache all day," Mastra's LLM can extract the explicit fact "user encountered difficulty with Postgres."
Mastra's problem: Once extraction is complete, the original conversation is discarded. If GPT-5-mini misses a detail during extraction -- say it interprets "headache all day" as an emotional expression rather than recording it as time investment -- that information is permanently lost. MemPalace preserves original text, so this failure mode doesn't exist.
What the score gap means: MemPalace raw (96.6%) exceeds Mastra (94.87%) by 1.7 percentage points. This gap is statistically significant -- on 500 questions it means a 9-question difference -- but it's not overwhelming. Considering that Mastra requires API costs while MemPalace doesn't, this 1.7pp advantage carries more weight.
Mem0 (~85% R@5)
Mem0 is one of the best-known products in this space -- brand recognition far exceeds MemPalace. It uses LLM extraction of "core memories" -- distilling conversations into brief factual snippets.
What Mem0 gets right: Its user experience is excellent. Integration is simple, the management interface is intuitive, and memory visualization is better than any competitor. For teams that don't want to self-manage, Mem0's hosted service eliminates all infrastructure concerns.
Mem0's fundamental problem: On the ConvoMem benchmark, Mem0 scored only 30-45% -- less than half of MemPalace's 92.9%. The reason is systemic, not accidental: the LLM extraction approach inevitably loses information. When the LLM compresses a 45-minute architecture discussion into "user prefers Postgres," it loses why Postgres was preferred, under what scenarios, what alternatives were compared, and what trade-offs were weighed. When subsequent questions involve this discarded context, the system can't find the answer.
Fair acknowledgment: Mem0's $19-249/month pricing includes commercial support, SLAs, and team collaboration features that MemPalace doesn't have. For an enterprise team that needs "out-of-the-box with someone responsible," Mem0's total cost of ownership may be lower than MemPalace's -- because MemPalace's "free" doesn't include operational labor costs.
Zep (~85% R@5)
Zep uses a graph database (similar to Neo4j) to store entity relationships. Its Graphiti system builds a time-aware knowledge graph -- relationships between entities have effective and expiration dates.
What Zep gets right: The knowledge graph approach has a natural advantage on entity relationship queries. "What project is Kai working on now?" -- a graph database can directly traverse edges to answer this without searching through document collections. The time-validity design is also elegant -- when facts change, old relationships are marked as expired rather than deleted.
MemPalace's comparison: MemPalace's knowledge graph (knowledge_graph.py) provides similar capabilities -- time validity, entity queries, timeline -- but uses SQLite underneath instead of Neo4j. This means zero additional dependencies and zero operational cost, but also means large-scale graph traversal may be slower than a specialized graph database.
Fair judgment: The ~85% score on LongMemEval may not fully represent Zep's capabilities. Zep's design goals go beyond retrieval -- its graph capabilities, entity relationship management, and enterprise compliance (SOC 2, HIPAA) are things MemPalace doesn't formally offer. If your need is "build a compliant enterprise-grade memory system," Zep's ~85% retrieval score may be an acceptable trade-off.
Hindsight (91.4% R@5)
Hindsight is a newer system, validated by Virginia Tech, using Gemini-3 and time-aware vector retrieval.
What Hindsight gets right: Its time-awareness approach is similar to the time enhancement in MemPalace's hybrid v2 -- adding temporal proximity as a signal on top of vector similarity. This direction is correct because many memory queries are fundamentally time-anchored.
Score positioning: 91.4% falls between Mem0/Zep (~85%) and Mastra (94.87%). It requires an LLM API (Gemini-3) but hasn't yet reached the level of API-free MemPalace raw (96.6%).
Where MemPalace Wins
From the four-dimensional comparison, MemPalace's competitive advantages can be clearly distilled:
On the accuracy dimension, MemPalace's raw mode (96.6%) without API already exceeds all API-requiring competitors except Supermemory ASMR's experimental version. The hybrid v4 + rerank score of 100% is the highest published LongMemEval score to date.
On the cost dimension, MemPalace is the only zero-cost option. All other systems require at least API call costs or subscription fees.
On the privacy dimension, MemPalace's raw mode is the closest thing here to a no-routine-data-egress solution, once local assets are prepared.
On the API dependency dimension, MemPalace is the only system that remains competitive without any API.
These advantages share a common technical root: MemPalace chose "preserve everything, use structure to organize" instead of "use AI to extract and compress." The direct consequence of this design decision is that no LLM participates in the indexing process, therefore no API is needed, no cost is incurred, and no data leaves the machine.
Where MemPalace Loses
Equally honestly, MemPalace is weaker than competitors in the following scenarios:
Multi-hop reasoning. The 60.3% baseline on LoCoMo shows that when answers require cross-session chaining, pure semantic retrieval isn't enough. Systems that use LLM-assisted memory extraction (Mem0, Mastra) can establish cross-session associations at the extraction stage -- "user mentioned interest in marine biology in session 5, found a related job in session 12" can be extracted as a coherent memory. MemPalace stores these two sessions separately and can only score them independently during search.
The hybrid v5 mode raised LoCoMo scores to 88.9% (R@10), primarily through keyword enhancement and person name extraction. Wings v3's speaker attribution design pushed the adversarial category from 34.0% to 92.8%. But the temporal-inference category -- requiring genuine temporal reasoning -- remains the weakest link.
Enterprise features. MemPalace has no management console, no team collaboration, no audit logs, no SLA. Zep and Mem0 as commercial products are far ahead on these dimensions. For enterprise customers who need "something IT can manage," MemPalace is currently not a viable option.
Integration ecosystem. Mem0 and Zep have rich SDKs (Python, JavaScript, Go), integrations with major frameworks, and detailed API documentation. MemPalace's integration methods are primarily MCP (Model Context Protocol) and CLI -- very convenient for developers already using Claude, but a higher barrier for users in other ecosystems.
Noisy data handling. In the MemBench benchmark's noisy category -- deliberately mixing distractor information into questions -- MemPalace scored only 43.4%. This exposes a structural weakness of the verbatim storage approach: when noise is indistinguishable from signal at the embedding level, retrieval quality degrades severely. Systems using LLM extraction can filter noise at the extraction stage, but MemPalace preserves everything -- including noise.
The Honesty Behind 100%
MemPalace scored 100% on LongMemEval -- 500/500, full marks on all six question types. This is a fact. But this fact needs some context.
The improvement path from 96.6% to 99.4% (hybrid v1 to v3) was based on category-level failure mode analysis -- each improvement targeted a class of questions, not specific individual questions. These improvements are generalizable.
But the final 3 questions from 99.4% to 100% were fixed by examining the specific failure reasons of those three particular questions:
- One question required exact phrase matching because it contained a quoted phrase
'sexual compulsions' - One question required proper name boosting because it involved the specific name
Rachel - One question required nostalgic pattern preference extraction because it involved high school memories
These three fixes are teaching to the test. They may generalize to similar query patterns, or they may be effective only on these three specific questions. In rigorous academic review, this is a methodological issue that needs annotation.
The team's approach to this was: establish a 50/450 dev/held-out split. On the 450 held-out questions never used for tuning, hybrid v4 scored 98.4% R@5, 99.8% R@10. These are the honest publishable numbers.
Three numbers tell three different stories:
- 96.6% -- the baseline capability with zero API, zero tuning, zero human intervention. The most conservative and most reliable claim.
- 98.4% -- the honest score on the held-out set, including generalizable improvements but excluding test-set tuning.
- 100% -- the full score on the complete test set, including three fixes targeting specific questions. Brilliant but requires annotation.
What 60.3% Means
If 100% is good news that needs context, 60.3% is bad news that needs analysis.
LoCoMo's 60.3% R@10 baseline means MemPalace's performance on multi-hop reasoning tasks is just "passing." Among five categories, temporal-inference scored only 46.0% -- near random level.
But "passing" doesn't equal "failing." There are three layers of analysis here.
First layer: this score is without API. All systems that use LLM assistance perform better on LoCoMo because multi-hop reasoning inherently requires understanding -- not just retrieval. MemPalace's 60.3% is the result of using a pure retrieval system on a reasoning task. Under the same conditions (no LLM), MemPalace's hybrid v5 already reaches 88.9%, exceeding Memori's 81.95%.
Second layer: optimization space has been validated. Wings v3's speaker attribution design boosted adversarial from 34.0% to 92.8% -- proving that structural improvements can dramatically boost LoCoMo scores. The bge-large embedding model (replacing the default all-MiniLM) lifted single-hop by 10.6pp. Haiku rerank pushed bge-large's score from 92.4% further to 96.3%. The direction of these improvements is clear.
Third layer: LoCoMo's structural limitations. Each conversation has only 19-32 sessions, and when top-k=50, all sessions are in the candidate pool, making rerank equivalent to reading comprehension. This means LoCoMo's 100% rerank score and LongMemEval's 100% rerank score can't be judged by the same standard. The former has structural guarantees; the latter is a genuine retrieval achievement.
The Fundamental Divergence in Design Philosophy
Behind all these comparisons are two fundamentally different design philosophies.
Route A: "Let AI decide what's worth remembering." This is the route of Mem0, Mastra, and Supermemory. The LLM reads the conversation, extracts key information, and discards the rest. The advantage is compact storage and small search space. The disadvantage is irreversible loss of original context -- once extraction goes wrong, there's no going back.
Route B: "Preserve everything, use structure to organize." This is MemPalace's route. No information filtering at the raw-storage layer, verbatim preservation of original conversations, and structure on top of that raw base. The advantage is preserving the source material and avoiding mandatory API dependency in the raw path. The disadvantage is a larger search space and harder multi-hop reasoning, while some higher-level compressed representations still remain heuristic rather than perfectly lossless.
LongMemEval results show: Route B's retrieval precision is not lower than Route A's, and is in fact higher. 96.6% vs 85-95% isn't a fluke -- it reflects a fundamental truth: when you've preserved all original text, the answer is always there waiting to be found. When you let an LLM extract memories, the answer may have already been "extracted" away.
LoCoMo results show: Route B is indeed weaker than Route A's potential on reasoning tasks. Verbatim storage preserves information but doesn't establish connections between information. This is an open engineering problem -- tunnels (cross-wing connections) in the palace structure and temporal validity in the knowledge graph are attempting to address it.
Ultimately, this isn't a question of "which route is better" but "what are you optimizing for." If your primary constraints are privacy and cost, Route B is the only choice. If your primary constraints are reasoning depth and enterprise compliance, Route A's commercial products may be more suitable. MemPalace chose Route B and has walked it to the farthest known point.
What Is Not Claimed
Finally, there are things MemPalace explicitly does not claim:
Does not claim "best AI memory system." That depends on what standard you use to define "best." On LongMemEval, yes. On LoCoMo's baseline, no. On enterprise features, far from it.
Does not claim competitors are bad. Mem0 does user experience better than MemPalace. Zep does compliance better than MemPalace. Supermemory's agentic search is more flexible than MemPalace in certain scenarios. Each system has made reasonable engineering choices within its own design constraints.
Does not claim 100% is unconditional. 100% has context. The 98.4% held-out score is the more honest number. The 96.6% API-free baseline is the most conservative claim. All three numbers are true, but they answer different questions.
Does not claim free equals zero cost. MemPalace's software is free, but running it requires your own machine, your own time, and your own operational ability. For enterprises with IT teams, $25/month Zep may have lower actual cost than "free but self-managed" MemPalace.
The most honest way to validate a system isn't to show only where you win, but to simultaneously show where you lose and explain why. This chapter attempts to do exactly that. The next part shifts from validation to the future -- MemPalace's roadmap, known unresolved issues, and the project's open directions.
Chapter 24: Local-First Is Not a Compromise
Positioning: MemPalace chose a local-first architecture not because of limited budget, nor because of insufficient technical capability to build a cloud service. Local-first is a deliberate architectural constraint, rooted in an understanding of the nature of memory data, the founder's values, and a belief in open source as infrastructure.
The Most Private Data
Your password was leaked. That's bad, but you can change your password. Your credit card number was stolen. That's also bad, but you can cancel and reissue. Your ID number was exposed. That's very bad, but at least your ID number won't tell the thief how you think.
Now consider a different scenario: your six months of AI conversation records are leaked.
What's in those conversations?
There's your hesitation at 3 AM discussing with the AI whether you should let go of an underperforming team member. There's the business judgment logic exposed while analyzing a competitor's product. There's the system architecture details exposed while debugging a security vulnerability. There's what you said when ruling out a technical option -- "this framework's community is too small, the maintainer looks like they're about to give up" -- a statement that could offend an entire open-source community if made public. There's the bottom line exposed while discussing salary negotiation strategy with the AI. There's what you said during an architecture decision: "I actually don't understand this domain well, but I can't let the team know."
These aren't data. These are your thought processes.
Traditional data breaches affect your identity and assets. A leak of AI conversation records affects the exposure of your judgment, decision patterns, and cognitive weaknesses. Passwords can be changed, credit cards can be reissued, but you can't "replace" your way of thinking. Once someone knows how you make decisions -- under what conditions you'll compromise, on what topics you lack confidence, which choices you tend toward under pressure -- that information can be exploited permanently.
This is why AI memory data is fundamentally different from every other type of personal data.
Consider a scenario where a team uses MemPalace. Five people on the team have deep technical conversations with AI every day. After six months, the memories accumulated in MemPalace include not just technical decisions but a complete portrait of team dynamics: whose proposals are frequently adopted, whose opinions are frequently overruled, who has technical disagreements with whom, which decisions were made through compromise. This is a cognitive X-ray of the organization.
Handing such data to a third-party server for hosting is like putting the organization's cognitive X-ray in someone else's safe -- even if they promise not to open and look. The issue isn't whether the other party is trustworthy, but that this trust relationship shouldn't need to exist in the first place.
A Trust Problem That Doesn't Need to Exist
Existing AI memory products -- Mem0, Zep, Letta -- use the standard SaaS model: your memory data is uploaded to their servers, and they provide storage, retrieval, and management. They guarantee security through SOC 2 compliance, HIPAA certification, and encrypted transmission.
These security measures are real and valuable. But they solve the wrong problem.
SOC 2 certification tells you the company has standardized security processes. It can't guarantee a data breach won't happen -- countless companies that passed SOC 2 audits have historically experienced data breaches. HIPAA compliance tells you the company knows how to handle sensitive health data. It can't guarantee your data remains safe when the company is acquired, goes bankrupt, or gets subpoenaed. End-to-end encryption tells you data is secure during transmission. It can't guarantee data won't be exposed during server-side decryption for processing -- and the server side must decrypt to perform semantic search.
The more fundamental question is: why does this trust relationship need to exist?
MemPalace's answer is: it doesn't.
When all data is on your machine -- ChromaDB on your hard drive, SQLite in your filesystem, AAAK-compressed text in your local directory -- there's no third party to trust. No SOC 2 needed, because there's no third-party server. No HIPAA needed, because data never leaves your device. No encrypted transmission needed, because there's no transmission.
This isn't a comparison of technical merits. Cloud solutions are technically perfectly viable -- their retrieval precision, storage efficiency, and user experience can all be excellent. This is a choice about trust architecture: manage trust risk through compliance certifications, or eliminate trust risk by eliminating the need for trust.
MemPalace chose the latter.
Values Foundation
This choice isn't accidental. To understand why it's MemPalace's default posture rather than an optional configuration, you need to trace back to Ben Sigman's career trajectory.
Before creating MemPalace, Ben spent considerable time in decentralized finance. Bitcoin Libre's work was decentralized lending -- a market where users can lend and borrow without trusting any centralized institution. This wasn't a technical experiment; it was a running product built on a clear value proposition: financial transactions shouldn't depend on trust in intermediaries.
The reasoning chain behind this value proposition goes like this: when you put funds in a bank, you trust the bank won't fail, won't freeze your account, won't be forced by the government to hand over your asset records. Most of the time this trust is reasonable. But the gap between "reasonable most of the time" and "always reasonable" is the reason decentralized finance exists. Decentralized lending doesn't say banks are untrustworthy -- it says in a system that can operate without requiring trust, why introduce trust as a variable?
Translate this reasoning chain to the memory system domain: when you put AI memory data on a SaaS server, you trust the service provider won't leak your data, won't change their privacy policy after being acquired, won't let your memories disappear along with their servers upon bankruptcy. Most of the time this trust is reasonable. But in a system that can operate without requiring trust, why introduce trust as a variable?
For someone who spent years in the decentralized lending market, this reasoning is natural, almost automatic. Local-first isn't a technical preference, isn't an engineering trade-off about latency or bandwidth -- it's a philosophical stance about trust architecture. Your money shouldn't depend on trusting intermediaries. Your memory even less so.
This is why MemPalace's local-first isn't a feature ("we also support local deployment!") but an architectural constraint: the core raw path is designed so day-to-day memory storage and retrieval do not require your data to leave your machine. Features can be turned off; architectural constraints cannot.
The Cost of Not Depending on Any Service
Local-first has costs. Acknowledging this is more honest than pretending it's a free lunch.
Cost one: no cross-device sync. Your MemPalace is on your laptop. If you want to access the same memories on your desktop, you need to solve the sync problem yourself -- via Git, rsync, shared filesystems, or whatever file sync solution you trust. SaaS products naturally solve this because data is in the cloud, accessible from any device.
Cost two: no collaboration features. A five-person team wanting to share a single MemPalace needs to build shared infrastructure themselves. SaaS products naturally support multi-user collaboration because data is on shared servers.
Cost three: no managed operations. ChromaDB crashes, you fix it yourself. SQLite file gets corrupted, you restore from backup yourself. SaaS products have operations teams monitoring 24/7 -- you don't have to worry.
Cost four: no pushed incremental improvements. SaaS products can continuously optimize retrieval algorithms, compression strategies, and index structures in the background -- users upgrade seamlessly. Local application upgrades require users to update proactively.
These are real costs. MemPalace doesn't pretend they don't exist. Its position is: these costs are worth bearing because the alternative's cost is higher -- the alternative's cost is introducing a third party you can't fully control to host your most private data.
But looking deeper, most of these costs are engineerable. Cross-device sync can be implemented through encrypted P2P synchronization (such as Syncthing) without a centralized server. Team collaboration can be implemented through shared filesystems or Git repositories. Data backup can use standard file backup tools. These solutions are rougher than SaaS out-of-the-box experiences, but they maintain the core constraint of local-first: data always remains on devices you control.
There's also an often-overlooked fact: for individual developers and small teams -- MemPalace's current primary user base -- cross-device sync and multi-person collaboration aren't core needs. One person using MemPalace on one machine is the most common and most natural usage pattern. In this pattern, the cost of local-first approaches zero while the benefit -- data entirely in your own hands -- is maximized.
The Significance of the MIT License
MemPalace's open-source nature isn't a marketing strategy. It's the logical extension of the local-first architecture.
Consider a hypothetical: MemPalace is local-first, but it's closed-source. Your data is on your machine, but the code processing the data is a black box. You can't audit whether it sends telemetry data to some server in the background. You can't confirm whether information leaks during the AAAK compression process. You can't verify whether ChromaDB's query process triggers network requests.
Is such a system local-first? From a data storage perspective, yes. From a trust perspective, no -- because you still need to trust a codebase you can't audit.
The MIT license solves this problem. Anyone can read every line of MemPalace's code and audit the actual data path. The stricter version of that claim should be: the core raw storage / search / wake-up loop is local-first, while optional networked capabilities such as benchmark rerank and Wikipedia lookup can be identified and evaluated separately. Code auditability means the "local-first" promise is verifiable, not merely a claim requiring trust.
The MIT license also solves another more long-term problem: survivability.
SaaS products have an inherent survivability risk: the company goes bankrupt, the service shuts down, and your data -- even if the company promises time to export before shutdown -- faces migration costs and format incompatibility issues. More critically, when an AI memory SaaS shuts down, you lose not just data but the logic for processing data -- how your memories were organized, how they were retrieved, how the compression algorithm works -- this knowledge disappears along with the company.
MemPalace doesn't have this risk. The code is in your hands, the data is in your hands. Even if MemPalace's GitHub repository disappears tomorrow, your forked copy is still a complete, runnable system. The MIT license ensures anyone has the right to fork, modify, and distribute. The survivability of your memory doesn't depend on the continued operation of any company, team, or individual.
This isn't a theoretical advantage. In the early stage of the AI memory market -- 2025-2026 -- product survival rates are uncertain. AI memory startups have already shut down during this period. Users of closed-source SaaS memory products face the risk of "my memories went with the company" every time. Users of MemPalace never face this risk.
The community's right to fork isn't just legal protection -- it's the foundation for building a technical ecosystem. When the core project's direction no longer suits certain users' needs, they can fork their own version. This isn't fragmentation -- this is the normal evolution of open-source software. Linux has countless distributions, each serving different user groups. MemPalace's MIT license grants the same possibility.
The Extreme Constraint of Zero Dependencies
One noteworthy aspect of MemPalace's tech stack is its dependency list:
Python 3.9+
chromadb>=0.4.0
pyyaml>=6.0
No required API key for the raw path. No required cloud service for day-to-day use. After local dependencies and default embedding assets are prepared, routine raw operation no longer needs internet access.
This extreme zero-dependency constraint isn't accidental. Every external dependency is a potential trust point and failure point. Requiring an API key means your data (at least query content) leaves your machine. Requiring cloud services means your system availability depends on someone else's servers. Requiring an internet connection means your memory system is unavailable on planes, in network-restricted environments, or during offline development.
MemPalace chose a more radical stance: after installation, unplug the network cable, everything works as before.
ChromaDB is an embedded vector database that runs in-process with data stored on the local filesystem. It doesn't need a separate database server, doesn't need a network connection, doesn't need configuration. SQLite -- the knowledge graph's storage backend -- is the exemplar of embedded databases, not even needing a separate process. AAAK compression completes entirely locally, dependent on no external model or service.
The engineering implications of this constraint are profound. It means MemPalace can't use any feature requiring network calls -- even if those features could significantly boost performance. For example, replacing the local all-MiniLM-L6-v2 with a cloud-based large embedding model (such as OpenAI's text-embedding-3-large) would almost certainly improve retrieval precision. But doing so would introduce dependency on an external service, breaking the zero-dependency constraint.
MemPalace's choice is: achieve 96.6% precision with a local embedding model rather than pursuing higher scores with a cloud model. 96.6% under zero API calls is already the highest score ever achieved. This score isn't the result of compromise -- it's an achievement under strict constraints.
Haiku reranking is an interesting design point. It's the only optional network feature in MemPalace -- using Claude Haiku to rerank local retrieval results can boost precision from 96.6% to 100%. But the keyword is "optional." Without enabling it, the system works perfectly. Enabling it provides icing on the cake, not a lifeline. This design precisely expresses MemPalace's attitude toward network dependency: it can exist, but must not be required.
Three Scenarios
Let's use three concrete scenarios to illustrate what local-first means in practice.
Scenario one: security audit. A fintech company's security team needs to audit all systems that process customer data. For a SaaS memory product, auditing means reviewing the third party's security certifications, data processing agreements, sub-processor lists, and data residency policies. For MemPalace, the audit focus shifts to: read the source code, confirm that day-to-day raw storage / search / wake-up stay local, and separately identify optional networked paths such as benchmark rerank or Wikipedia lookup. An afternoon of code review can often clarify the core data path.
Scenario two: company shutdown. A team using a certain AI memory SaaS product receives notice: the service will shut down in 60 days. The team needs to export all data within 60 days, find an alternative, migrate, and verify data integrity. This is a high-pressure, time-limited engineering task, and it usually happens at the most inconvenient time. A team using MemPalace never faces this scenario. Data is local, code is on GitHub (or your fork). Nothing needs to be "migrated."
Scenario three: offline environment. A developer on a long flight needs to review a discussion about database sharding strategy from three months ago. Using a cloud memory product, this is impossible -- no network means no memory. Using MemPalace, mempalace search "sharding strategy" returns results instantly, locally. Your memory doesn't depend on whether you're online.
These three scenarios aren't edge cases. Security audits are routine operations in regulated industries. Company shutdowns are a statistical certainty in the startup ecosystem. Offline work is the norm in the mobile work era. Local-first isn't a "nice to have" in these scenarios -- it's a decisive advantage.
Not Against the Cloud, Against Forced Trust
One point needs to be clear: this chapter's argument isn't "cloud services are bad." Cloud services are the right choice in many scenarios -- when you need real-time multi-person collaboration, when you need globally distributed access, when you need operations-free infrastructure.
This chapter's argument is: for the specific data type of AI memory, local-first is the more reasonable default.
The reason traces back to this chapter's opening argument: AI memory data is one of the most private data types. It contains not your identity or financial information -- it contains your thought processes. For such data, "data in your hands" isn't an optional security hardening measure but should be the default architectural posture.
MemPalace implements this posture with a concise tech stack: Python + ChromaDB + SQLite + AAAK. No mandatory servers, no mandatory API path, no subscriptions, and no code you can't audit. Your memory is on your machine, the code processing your memory is in your GitHub fork, and the AAAK dialect compressing your memory is a public specification.
This isn't a technical limitation. It's a design decision. A design decision that grew naturally from the values of the decentralized lending market.
Going Deeper: The Philosophy of Infrastructure
At the deepest level, local-first reflects an answer to the question "who should control infrastructure."
The early internet -- 1990s to early 2000s -- had a natural decentralization tendency. Your email could run on your own server. Your website could be hosted on your own machine. Your data was by default on your hard drive. This wasn't ideologically driven -- it was simply the natural state of technology at the time.
The cloud computing wave of the 2010s changed this default. Infrastructure migrated from local to cloud -- first compute, then storage, then databases, and finally almost everything. This migration had real engineering benefits: elastic scaling, operations-free, global reach. But it also changed a fundamental power relationship: your data was no longer in your hands.
For most types of data -- code (GitHub), documents (Google Docs), communications (Slack) -- this power relationship change is acceptable. The sensitivity of this data is limited, migration costs are manageable, and the convenience cloud services bring is sufficient to offset the surrender of control.
But AI memory is a different data type. Its sensitivity is extremely high (your thought processes), its migration cost is extremely large (a memory system is not just data but also organizational structure and retrieval logic), and its dependency is extremely deep (your AI assistant's effectiveness directly depends on memory availability). For such data, the cost of surrendering control may exceed all the convenience that cloud services bring.
MemPalace's local-first architecture, combined with the MIT open-source license and a minimal-dependency stack, together constitute a strong control guarantee system:
- Data is on your machine (physical control).
- Code is open-source (audit rights).
- The license permits forking and modification (modification and distribution rights).
- No dependency on external services (right to operate isn't constrained by third parties).
These four layers of protection aren't independent -- they depend on each other, and none can be missing. Data local but code closed-source -- you can't audit. Code open-source but requires an API key -- your right to operate is constrained. Code open-source, data local, but the license doesn't permit forking -- your long-term survivability isn't guaranteed.
MemPalace satisfies all four conditions simultaneously. This isn't a set of coincidental choices but a complete architecture derived from the principle that "users should have full control over their own memory infrastructure."
Local-first is not a compromise. Local-first is the conclusion.
Chapter 25: Beyond Conversations
Positioning: MemPalace's current validation focuses on conversational memory, but its architecture -- the Wing/Hall/Room/Closet/Drawer hierarchy, the AAAK compression dialect, the temporal knowledge graph -- doesn't depend on "conversation" as a specific data type. This chapter analyzes this architecture's adaptation potential in other domains, as well as the technical roadmap for AAAK entering the Closet layer.
A Structure Bigger Than Conversations
There's an easy-to-overlook sentence in MemPalace's README:
"It has been tested on conversations -- but it can be adapted for different types of datastores."
This isn't a casual aspiration. It's a statement about the architecture's essence.
Recall MemPalace's core structure: Wing is a domain boundary, Room is a concept node, Hall is a classification dimension, Closet is a compressed summary, Drawer is raw content. Among these five layers, none depends by definition on "conversation" as a data form. A Wing doesn't care whether it contains conversation records or code files -- it only cares that "these things belong to the same domain." A Room doesn't care whether it represents a discussion topic or a code module -- it only cares that "this is an independent conceptual unit."
This means MemPalace's spatial structure is data-type agnostic. The palace's retrieval effectiveness comes from the structure itself -- semantic partitioning reduces the search space, hierarchical filtering improves hit precision -- not from the specific format of stored content. The 34% retrieval precision improvement analyzed in Chapter 4 comes from Wing and Room structured filtering, independent of whether the filtered content is conversations or code.
Of course, there's distance between "theoretically possible" and "engineeringly feasible." Let's analyze several specific directions.
Codebase: Wing Is Project, Room Is Module
A mid-sized software team manages five microservices, two frontend applications, and a shared library. Six months later, no one remembers why payment-service's retry logic uses exponential backoff instead of fixed intervals, nor why shared-lib has that seemingly redundant abstraction layer or what specific problem it solved.
Code comments and commit messages should theoretically record this information. In practice, most commit messages are "fix bug" or "refactor auth module," and code comments either don't exist or are outdated. The real design reasoning is scattered across AI conversations, Slack discussions, and closed PR comments.
Adapting MemPalace to a codebase scenario, the mapping is natural:
Wing = project (payment-service, user-frontend, shared-lib)
Room = module or concern (retry-logic, auth-middleware, database-schema)
Hall = knowledge type (hall_facts: design decisions, hall_events: refactoring history,
hall_discoveries: performance findings, hall_advice: best practices)
Closet = module's compressed summary (design intent, key constraints, known limitations)
Drawer = raw content (related conversation records, PR descriptions, design document fragments)
The most valuable part of this mapping is Tunnel -- cross-Wing conceptual connections. When both payment-service and user-frontend have a Room named auth-middleware, Tunnel can tie them together at the graph layer. This means a deeper codebase-memory workflow could surface both backend and frontend perspectives on authentication-related design decisions, even if they were discussed at different times in different contexts.
Among MemPalace's three existing mining modes, the projects mode (mempalace mine <dir>) already supports ingestion of code and documentation files. More precisely, the current implementation gets wing from mempalace.yaml or a --wing override, and routes room through a mix of path hits, filename hints, and content keywords. That is already enough to support "ingest project files into one palace and retrieve them by room," but it is not well described as "maps code files to Wing and Room by directory structure." Building on this, deeper adaptations -- such as generating stronger cross-room structure from import relationships or extracting temporal change information from Git history -- remain engineerable extensions.
Document Library: Wing Is Knowledge Domain, Room Is Topic
The core problem facing enterprise document management isn't storage -- storage is never the problem. The problem is retrieval. When an organization has thousands of pages of product documentation, technical specifications, meeting minutes, and research reports, "find that document about GDPR-compliant data retention policies" becomes a non-trivial retrieval task.
Existing document management systems -- Confluence, Notion, SharePoint -- use folder hierarchies and tags to organize documents. The limitations of these organizational approaches were already analyzed in Chapter 4: they're categories from the administrator's perspective, not navigation structures from the searcher's perspective.
MemPalace's palace structure offers a different organizational approach:
Wing = knowledge domain (compliance, product-design, engineering-standards)
Room = specific topic (gdpr-data-retention, oauth-implementation, api-versioning)
Hall = document type (hall_facts: specifications and standards, hall_events: meeting resolutions,
hall_advice: implementation guidelines)
The key advantage of this structure is: when searching, you don't need to know the document's title or tags -- you only need to describe the information you're looking for, and the system can use Wing / Room structure to narrow the search space. If document-memory adaptation is pushed further, a natural-language query like "what special requirements does our data retention policy have for EU users" could be organized toward something like wing_compliance / gdpr-data-retention before semantic retrieval. What needs to stay separate from the current implementation is that the public searcher.py supports only explicit wing / room filtering; it does not yet auto-classify natural-language queries into hall_facts or a specific room at runtime.
Email and Communications: Wing Is Contact, Room Is Project
Another natural adaptation direction is email and communication records. MemPalace already supports ingestion of Slack exports. Extending this capability to email, the mapping is clear:
Wing = contact or team (wing_client_acme, wing_vendor_stripe, wing_team_infra)
Room = project or topic (contract-renewal, api-integration, incident-2026-03)
Tunnel is especially valuable in this scenario. When client Acme's contract renewal discussion (wing_client_acme / contract-renewal) and the internal infrastructure team's capacity planning discussion (wing_team_infra / capacity-planning) touch on the same topic -- such as "how many additional compute resources are needed for next year's SLA commitments" -- Tunnel automatically establishes the connection. When reviewing client negotiation history, you can automatically discover the internal team's related discussions, and vice versa.
Note Systems: Wing Is Domain, Room Is Concept
The core philosophy of personal knowledge management tools -- Obsidian, Logseq, Roam Research -- is bidirectional linking: connections between notes are as important as the notes themselves. MemPalace's Tunnel mechanism is essentially bidirectional linking -- when the same Room name appears in different Wings, connections are automatically created.
Wing = knowledge domain (distributed-systems, machine-learning, product-management)
Room = concept (consensus-algorithms, gradient-descent, user-retention)
An interesting possibility is: MemPalace's palace structure could serve as a retrieval acceleration layer for existing note tools. You continue writing notes in Obsidian, but MemPalace ingests note content into the palace structure in the background, providing cross-note semantic retrieval and automatic association discovery. Note tools excel at creation and browsing; MemPalace excels at retrieval and association. The combination of both may be more powerful than using either one alone.
AAAK Entering the Closet Layer
All the above extension directions can be implemented on MemPalace's current architecture -- they essentially change the ingestion pipeline and mapping rules, with no changes needed to the core storage and retrieval mechanisms. But there's a deeper technical evolution direction that will significantly change the system's performance characteristics: the AAAK dialect entering the Closet layer.
To understand the implications of this evolution, you first need to understand how the Closet layer currently works.
To understand the significance of this evolution, the conceptual layer and the current runtime have to be separated.
More accurately, the current public repository does not yet expose an explicit Closet storage layer. mempalace mine currently writes chunked raw text directly into mempalace_drawers; the main metadata fields are wing, room, source_file, and similar tags. searcher.py's default path also queries that drawers collection directly and returns verbatim text. In other words, Drawer is the runtime layer that is clearly implemented today; Closet is still much closer to the README and benchmark narrative of a navigational middle layer.
So when the README says "add AAAK directly to the closets," it is not describing a fully implemented subsystem waiting only for a new encoding. It is describing the next step: turn that conceptual / experimental middle layer into an explicit compressed navigation layer.
MemPalace's README explicitly mentions this evolution direction:
"In our next update, we'll add AAAK directly to the closets, which will be a real game changer -- the amount of info in the closets will be much bigger, but it will take up far less space and far less reading time for your agent."
Let's analyze the feasibility of this direction based on dialect.py's current capabilities.
The Dialect class's compress() method accepts plain text input and outputs AAAK format. It does several things:
First, entity detection and encoding. _detect_entities_in_text() scans text for known entities (via preconfigured entity mappings) and suspected entities (via capitalized-word heuristics), encoding "Kai" in "Kai recommended Clerk" as "KAI."
Second, topic extraction. _extract_topics() extracts key topic words through word frequency analysis and heuristic weighting (capitalized words, terms containing hyphens/underscores get bonus points), compressing lengthy descriptions into topic tags like auth_migration_clerk.
Third, key sentence extraction. _extract_key_sentence() scores each sentence -- sentences containing decision words ("decided," "because," "instead") score higher, shorter sentences are preferred -- extracting the most information-dense fragments.
Fourth, emotion and flag detection. _detect_emotions() and _detect_flags() detect the text's emotional tendency and importance markers (DECISION, ORIGIN, TECHNICAL, etc.) through keyword matching.
A 500-word conversation summary, after processing by compress(), might be compressed into two or three lines of AAAK format:
wing_kai|auth-migration|2026-01|session_042
0:KAI+PRI|auth_migration_clerk|"Chose Clerk over Auth0 pricing+dx"|determ+convict|DECISION+TECHNICAL
Approximately 30 tokens. The original summary might be 300 tokens. A compression ratio of roughly 10x.
When this compression is applied to the Closet layer, the effect is twofold.
Effect one: the same storage space can hold more information. If a Closet could previously store 10 summaries (3000 tokens), after AAAK conversion it can store 100 (same 3000 tokens). This means the AI gains ten times the contextual coverage when reading a single Closet.
Effect two: the AI reads faster. AAAK is designed as a format instantly comprehensible to AI -- it teaches the AI AAAK syntax in the mempalace_status response, and the AI directly parses AAAK in subsequent interactions. Reading a 30-token AAAK summary is much faster than reading a 300-token English summary, while the information content is equivalent. In scenarios requiring scanning many Closets to locate information, this speed difference is decisive.
From dialect.py's current implementation, this evolution is technically feasible. The compress() method can already handle arbitrary plain text input, independent of any specific data structure. Integrating it into the ingestion pipeline -- calling dialect.compress() for AAAK encoding after generating Closet summaries -- is an incremental engineering change that doesn't require restructuring the core architecture.
One technical consideration to note: AAAK-compressed text may behave differently in the semantic embedding space than original English. The embedding model used by ChromaDB (such as all-MiniLM-L6-v2) was trained on English text, and AAAK-formatted text -- like KAI+PRI|auth_migration_clerk -- may produce embedding vectors different from the English equivalent description. This means that after Closet layer AAAK conversion, semantic matching between search queries (typically English natural language) and Closet content (AAAK format) may need adjustment.
One possible solution is dual storage: Closet simultaneously retains the AAAK version (for AI reading) and the original English version (for embedding retrieval). This adds some storage overhead but maintains retrieval precision. Another approach is to also convert queries to AAAK format during search, so that queries and content match in the same representation space -- but this requires validating the embedding model's behavior on AAAK text.
Regardless of which approach is adopted, the direction of AAAK entering the Closet layer is clear, and feasibility is well-founded. It's not a feature that needs to be reinvented but an application of existing AAAK encoding capability to the existing Closet architecture.
The Open Source Community's Exploration Space
MemPalace is open-sourced under the MIT license, meaning all the above extension directions don't need to wait for the official team to implement. Any interested developer in the community can fork the project and implement their own ingestion pipeline adaptations.
Several specific exploration spaces worth highlighting:
Diversification of ingestion pipelines. The current convo_miner.py handles normalization of five conversation formats. The same pipeline pattern can be extended to more data types: ingestion of Git commits and PR comments, ingestion of Obsidian vaults, ingestion of browser bookmarks and highlights. Each data type needs a normalizer (converting raw format to standard structure); the rest of the palace logic can be reused.
Automatic discovery of Wing/Room. The current mempalace init is closer to "local scan + interactive confirmation": it detects candidate rooms from directories, filenames, and content, then lets the user accept, edit, or extend them. For large datasets, stronger automatic discovery may be more practical -- using clustering or other structure-learning methods to identify domain boundaries (Wings) and concept nodes (Rooms). This is especially valuable in document libraries and email archives.
Cross-source fusion of the knowledge graph. When different types of data are ingested into the same palace, the knowledge graph (knowledge_graph.py) provides a local triple store capable of holding cross-source entity relationships. A client name mentioned in email, the same name appearing in code comments, and the same client discussed in meeting minutes could, in principle, all be expressed as temporal triples inside the same graph. What must be clarified is that the public repository clearly implements the add_triple / query_entity layer, not a fully automatic cross-source extraction-and-fusion pipeline.
Domain extension of Specialist Agents. The README has used reviewer, architect, and ops as software-development specialist examples, but the current public repository actually implements a lower-level generic mechanism: any agent_name can own its own wing_<agent>/diary. Precisely because that base is simple, the same storage structure could be extended to other domains: a sales agent tracking client-relationship evolution, a research agent tracking papers and research directions, or a legal agent tracking changes in compliance requirements.
No Roadmap Commitments
This chapter deliberately uses wording like "could," "can," and "direction" rather than "will," "plans to," or "expected to." The reason is simple: MemPalace is an actively developing open-source project, and its future direction depends on community needs, contributor interests, and actual engineering validation. Drawing a polished product roadmap is easy; delivering on it is hard.
The more honest approach is to say: MemPalace's architecture -- Wing/Hall/Room spatial structure, AAAK compression capability, temporal knowledge graph -- is general by design. The domain they've been validated in is conversational memory, with validation results of 96.6% (zero API) and 100% (Haiku reranking). Whether they can achieve the same effectiveness in codebases, documents, email, and notes requires actual engineering attempts and benchmark validation.
This is also where open source's value lies. When a closed-source product says "we will support codebase memory," you can only wait. When an open-source project says "the architecture supports codebase memory," you can verify it yourself. Fork the code, write a code ingestion pipeline, run a benchmark -- the entire validation process is open to anyone.
The Palace's Boundaries
MemPalace's core insight is: structure matters more than algorithms. On the retrieval problem, a good spatial organization structure delivers precision improvements (34%) that exceed what most pure algorithm optimizations can achieve.
This insight isn't limited to conversations. It applies to any scenario requiring fast localization of specific knowledge within large volumes of information. Design decision retrieval in codebases, policy lookup in document libraries, historical discussion tracing in email, concept association discovery in notes -- all these scenarios face the same core problem: the search space is too large, and pure semantic matching lacks sufficient discrimination.
MemPalace's palace structure -- through the introduction of domain boundaries (Wing), classification dimensions (Hall), and concept nodes (Room) -- provides a solution to this problem that is agnostic to data type. It doesn't rely on larger models, better embeddings, or more computation -- it relies on better organization.
This is a simple but profound engineering judgment: rather than having AI search through 22,000 unstructured records, first build a palace for those records so the AI knows which room to look in.
Conversations are just the first domain where MemPalace validates this judgment. They won't be the last.
Chapter 26: Why Rewrite in Rust
Positioning: This chapter traces the path from analyzing MemPalace to deciding to reforge it in Rust. Prerequisite: Chapters 1-25 (the analysis that revealed the structural gaps). Applicable scenario: when you have thoroughly analyzed an existing system and must decide whether to patch or rebuild.
The Analysis That Became a Blueprint
This book began as a third-party design analysis. We set out to understand MemPalace's architecture — its spatial metaphor, its compression language, its local-first philosophy — and to assess how well the implementation delivered on those ideas. Twenty-five chapters and four appendices later, the analysis was complete. But something unexpected happened along the way: the analysis itself became a blueprint for a new implementation.
The twenty-five chapters that precede this one are not preamble to a Rust project. They are the reason the Rust project exists. Every design decision in mempal — our Rust reimplementation — traces back to a specific finding in this book. This chapter documents those connections.
The story is not "MemPalace was bad, so we rewrote it." MemPalace's design ideas are sound — verbatim storage, spatial retrieval, AAAK compression, local-first architecture. The story is: "MemPalace's ideas are good enough to deserve a more rigorous implementation."
Trigger Points: What the Analysis Revealed
Three findings from the book's analysis converged into a single conclusion: the gap between MemPalace's design intent and its implementation was structural, not incidental.
graph LR
subgraph "Book Analysis (Chapters 1-25)"
A1["Ch 9 + App C:<br/>AAAK has no<br/>formal grammar"]
A2["Ch 5 + Ch 7 + App D:<br/>3 of 5 tiers<br/>underutilized"]
A3["Ch 19 + App D:<br/>19 tools,<br/>8 depend on<br/>unfinished subsystems"]
end
subgraph "mempal Response"
R1["mempal-aaak:<br/>BNF + encoder +<br/>decoder + round-trip"]
R2["mempal-core:<br/>Wing/Room only +<br/>editable taxonomy"]
R3["mempal-mcp:<br/>5 tools +<br/>self-describing protocol"]
end
A1 --> R1
A2 --> R2
A3 --> R3
AAAK: A Language Without a Grammar
Chapter 9 walked through AAAK's six grammar elements — three-letter entity codes, pipe delimiters, star ratings, emotion tags, semantic flags, and tunnels. The syntax is elegant. But Appendix C, which attempted to provide a complete dialect reference, revealed a critical gap: there is no formal grammar.
The dialect.py encoder is a heuristic pipeline. It selects key sentences, extracts top-frequency topics, truncates entity and emotion lists, and concatenates them with pipe delimiters. There is no BNF specification defining what constitutes a valid AAAK document. There is no decoder — once text is compressed, the only way to "decompress" it is to have an LLM read it. There is no round-trip test verifying that encoding and decoding preserve all factual assertions.
As Appendix D put it: if AAAK is read as a design direction, it is credible. If read as a fully delivered product capability, it is not. The encoder produces output that looks like AAAK but isn't mechanically verifiable as AAAK, because "valid AAAK" has no formal definition.
This is not a bug to fix with a patch. A formal grammar, a conforming encoder, a decoder, and round-trip verification tests constitute a redesign of the AAAK subsystem from its specification layer up. In mempal, this became the mempal-aaak crate (crates/mempal-aaak/), which implements a BNF-defined grammar, a structured encoder, a decoder, and round-trip property tests — the complete stack that dialect.py was missing.
Five Tiers, Three Unused
Chapter 5 documented the five-tier spatial hierarchy: Wing → Hall → Room → Closet → Drawer. Chapter 7 demonstrated the retrieval improvement: from 60.9% baseline to 94.8% with Wing, Hall, and Room filtering — a gain of 33.9 percentage points. But our analysis revealed that the improvement is not evenly distributed across tiers.
The bulk of the retrieval gain comes from Wing filtering alone (+12.2 percentage points). Adding Hall brings a further +11.7 points. Room adds another +10 points. The deeper tiers — Closet and Drawer — were not benchmarked separately, and contribute no measured retrieval benefit beyond Room.
More importantly, Appendix D found that "Hall / Closet / agent architecture is narrated more completely than implemented." The current searcher.py supports explicit Wing and Room filtering. Hall exists as a taxonomy concept but is not a default routing target. Closet exists as a storage-layer idea but the runtime primarily operates on Drawers directly.
This means three of the five tiers in the spatial hierarchy are, in practice, either underutilized or aspirational. A redesign could simplify to two tiers — Wing and Room — while preserving the vast majority of the retrieval benefit, and replace the static hierarchy with an editable taxonomy that adapts to actual usage patterns. In mempal, this simplification lives in mempal-core (crates/mempal-core/src/db.rs), where the drawers table has wing and room columns with an editable taxonomy table driving query routing — no Hall, no Closet, no static tier hierarchy.
19 Tools, 5 Cognitive Roles
Chapter 19 analyzed the MCP server's 19 tools organized into 5 cognitive roles: Read (7), Write (2), Knowledge Graph (5), Navigation (3), and Diary (2). The role-based organization is intellectually coherent. But for an AI agent making tool-selection decisions, 19 choices create a large decision space.
Each tool call costs tokens — not just for the call itself, but for the LLM to evaluate which of 19 tools best matches its current intent. The Knowledge Graph group (5 tools) and Navigation group (3 tools) depend on subsystems that Appendix D flagged as more narrated than exercised. The Diary group assumes a specialist agent architecture that is not yet part of the default runtime.
The question became: could a smaller tool surface — focused on what is actually production-ready — serve agents better? Not because fewer tools are inherently better, but because each tool in a smaller set can carry richer self-documentation, and agents spend fewer tokens on tool selection. mempal's answer is 5 tools — mempal_status, mempal_search, mempal_ingest, mempal_delete, and mempal_taxonomy — registered in crates/mempal-mcp/src/server.rs. Each tool's input schema carries doc comments that teach agents how to use it correctly (see Chapter 28).
The Judgment: Why Patching Was Not Enough
With these three findings in hand, we faced a choice: contribute patches to MemPalace, or build a new implementation from the same design principles.
We chose to rebuild. Here is why.
Coupling Makes Targeted Fixes Expensive
The AAAK subsystem touches multiple layers. dialect.py is invoked by cli.py for compression, while mcp_server.py hardcodes an AAAK_SPEC string for status responses. The encoder's output format is referenced by palace_graph.py's storage paths and searcher.py's retrieval logic. Adding a formal grammar requires changing not just the encoder but the interfaces between compression, storage, and retrieval. In a 21-module Python codebase where modules share state through ChromaDB collections and in-memory caches, changing one subsystem's contract ripples through others.
Similarly, simplifying the five-tier hierarchy to two tiers is not a matter of deleting three layers. The tier structure is embedded in storage schemas, query routing logic, and MCP tool semantics. A targeted "remove Hall and Closet" patch would need to touch palace_graph.py, searcher.py, mcp_server.py, and layers.py — essentially rewriting the data model while preserving backwards compatibility with existing palaces.
The Design Document Already Existed
By the time we finished the book's analysis, we had something unusual: a detailed, evidence-based specification for what the implementation should look like. The book's 25 chapters identified what works (verbatim storage, Wing/Room filtering, MCP interface), what doesn't (heuristic AAAK, unused tiers, oversized tool surface), and what the design intent was (local-first, cross-model, single-file storage).
That specification was more complete than what most rewrite projects start with. We were not guessing at requirements — we had derived them from 25 chapters of analysis. The risk profile of "rebuild from a detailed spec" was lower than "patch an existing system while preserving backwards compatibility."
Different Product Form, Different Language
MemPalace is a Python library. It requires pip install, a Python runtime, and ChromaDB as a vector store. For a developer's personal tool, this is acceptable. For a tool that every coding agent should be able to use — Claude Code, Codex, Cursor, Gemini CLI — the installation friction matters.
The product vision for the reimplementation was different: a single binary, zero external dependencies, ready in seconds. This product form is not achievable as a patch to a Python codebase — it requires starting from a different foundation entirely.
The Single-Binary Philosophy
The product form drove many technical decisions. A single binary means:
Zero installation friction. One command, one binary, no runtime, no package manager conflicts, no virtual environment. An AI agent's MCP configuration points to the binary path and it works.
Single-file database. SQLite replaces ChromaDB. The entire memory palace is one file: ~/.mempal/palace.db. Backup is cp. Migration is scp. There is no server process to manage, no port to configure, no data directory to locate.
Embedded inference. The ONNX Runtime is compiled into the binary. The default embedding model (MiniLM-L6-v2, 384 dimensions) downloads once and runs locally. No API keys, no network dependency after first run, no per-query cost.
Self-contained MCP server. mempal serve --mcp starts a stdio-based MCP server from the same binary. No separate server process, no port allocation, no process management.
This is not minimalism for its own sake. It is a design response to a concrete problem: coding agents run in diverse environments — local terminals, CI pipelines, remote SSH sessions, containerized dev environments. A tool that requires pip install plus a ChromaDB instance plus a Python runtime is viable in some of these environments. A single binary is viable in all of them.
Why Rust, Specifically
Given the single-binary requirement, several languages could work: Go, Zig, C++, or Rust. We chose Rust for reasons specific to mempal's use case, not for general language advocacy.
MCP servers are long-running processes. When an AI agent connects to mempal via MCP, the server process stays alive for the duration of the session — potentially hours. A long-running process that holds open SQLite connections, manages embedding model state, and serves concurrent tool calls benefits from deterministic resource cleanup. Rust's ownership model ensures database connections are released exactly when their scope ends and memory is freed without relying on a garbage collector's timing — predictability that matters for a process that may run unattended for an entire coding session.
The type system enforces interface contracts. mempal has 8 crates with well-defined boundaries: mempal-core defines data types, mempal-embed defines the Embedder trait, mempal-search consumes both. Rust's type system ensures that when we change a type in mempal-core, every downstream consumer is checked at compile time. In a rewrite that redesigns subsystem interfaces — exactly what we needed after the book's analysis — this is not a luxury but a necessity.
crates.io distribution. cargo install mempal is the entire installation story. The Rust ecosystem's package registry and build system align perfectly with the single-binary product form. No PyPI wheel compatibility issues, no platform-specific build scripts, no runtime version conflicts.
SQLite and ONNX have mature Rust bindings. rusqlite (with bundled SQLite) and ort (ONNX Runtime) are production-grade crates. sqlite-vec provides vector search as a SQLite extension. The specific technical stack mempal needs — embedded database plus vector search plus ML inference — has strong Rust ecosystem support.
What Rust Does Not Solve
Language choice does not solve design problems. Rust did not tell us to simplify five tiers to two, or to add a formal grammar to AAAK, or to reduce 19 MCP tools to 5. Those decisions came from the book's analysis. Rust provided a vehicle for implementing those decisions in a product form that serves coding agents well.
Rust also did not eliminate all challenges. The embedding model (MiniLM) is English-centric, which degrades search quality for non-English queries — a problem we discovered during dogfooding and addressed with a protocol-level workaround rather than a language-level solution (see Chapter 28). The type system catches interface mismatches but cannot verify that a search result is semantically relevant. Static analysis prevents memory corruption but not data loss from overly broad deletion queries — a lesson we learned the hard way during development (see Chapter 29).
The decision to use Rust was pragmatic, not ideological. A different project with different deployment constraints might reasonably choose Go, Zig, or even stay with Python. What mattered was not the language but the clarity of the specification — and that specification was the product of twenty-five chapters of analysis.
From Analysis to Practice
This chapter marks a transition in the book's narrative. For twenty-five chapters, we analyzed someone else's design. We identified what works, what needs improvement, and what the implementation gaps are. Starting here, we turn that analysis into practice.
The remaining chapters of Part 10 trace the specific design decisions that followed:
- Chapter 27 details what we preserved from MemPalace and what we changed, with evidence for each decision.
- Chapter 28 examines the self-describing protocol — how mempal teaches AI agents to use it correctly, and why each protocol rule exists because of a real failure.
- Chapter 29 documents multi-agent coordination — the unexpected discovery that a memory tool becomes a coordination mechanism between different AI agents.
The loop from analysis to implementation is not closed yet. Benchmarks have not been run against MemPalace's published numbers. The temporal knowledge graph remains a schema reservation rather than a working feature. These honest gaps will be addressed in a future chapter. For now, the chapters ahead show what happens when twenty-five chapters of critique meet a Rust compiler.
Chapter 27: What Stayed, What Changed
Positioning: This chapter compares MemPalace (Python) and mempal (Rust) dimension by dimension — what design ideas survived the rewrite, what implementation structures changed, and why. Prerequisite: Chapter 26 (why the rewrite happened). Applicable scenario: when evaluating which parts of an existing design deserve preservation versus redesign.
Five Ideas Worth Preserving
Before cataloging changes, we should be clear about what did not change. mempal preserves five core design ideas from MemPalace — each one validated by this book's analysis.
1. Verbatim storage (Chapter 3). Raw conversation text is stored exactly as ingested. Drawers in mempal's drawers table (crates/mempal-core/src/db.rs:16-25) hold the original content, never a summary or extraction. Chapter 3's economic argument holds: 19.5 million tokens of conversation data over six months is roughly 100 MB of raw text — storage cost is negligible. The real problem is retrieval, not storage. mempal does not compress on ingest.
2. Spatial structure for retrieval (Chapters 5 and 7). The idea that semantic partitioning improves retrieval precision — organizing memories into named regions rather than dumping everything into a flat vector space — is preserved entirely. Chapter 7's data showed a 33.9 percentage-point improvement from baseline to full spatial filtering. mempal implements this through Wing and Room columns on every drawer, with indexes that make filtered queries fast (idx_drawers_wing, idx_drawers_wing_room in db.rs:51-52).
3. AAAK as output formatter, not storage encoder (Chapter 8). MemPalace's design doc describes AAAK as a compression layer. mempal follows the same principle but makes the boundary explicit: AAAK is never applied during ingestion or storage. It exists only on the output path — when a CLI wake-up command or an MCP status response wants to compress recent context for a token-constrained AI. The mempal-aaak crate has no dependency on mempal-ingest or mempal-search, enforced by the Cargo workspace dependency graph.
4. MCP as the primary AI interface (Chapter 19). MemPalace exposed its memory palace through MCP tools. mempal does the same: mempal serve --mcp starts a stdio-based MCP server. The conviction that AI agents need a structured tool interface — not just a REST API or a command-line wrapper — carries over directly.
5. Local-first architecture (Chapter 24). No cloud services, no API keys for core operation, no data leaving the machine. mempal's entire state is a single SQLite file. The embedding model runs locally via ONNX Runtime. This is not a compromise but a design choice — Chapter 24's argument about "cognitive X-rays" applies equally to mempal.
These five ideas are the load-bearing walls of the design. Everything else — the number of tiers, the storage engine, the compression implementation, the tool surface — is the interior that can be restructured.
Dimension 1: Spatial Structure — Five Tiers to Two
This is the most visible architectural change.
What MemPalace Does
MemPalace defines five tiers: Wing → Hall → Room → Closet → Drawer. Chapter 5 analyzed each tier's purpose. Wing scopes by person or project. Hall classifies by memory type (facts, events, discoveries, preferences, advice). Room identifies a specific concept. Closet holds a compressed AAAK summary. Drawer holds the original text.
graph TD
subgraph "MemPalace: Five Tiers"
W["Wing<br/>(person/project)"]
H["Hall<br/>(memory type)"]
R["Room<br/>(concept)"]
C["Closet<br/>(AAAK summary)"]
D["Drawer<br/>(original text)"]
W --> H --> R --> C --> D
end
subgraph "mempal: Two Tiers + Taxonomy"
MW["Wing<br/>(project)"]
MR["Room<br/>(auto-routed)"]
MD["Drawer<br/>(original text)"]
MT["Taxonomy<br/>(editable keywords)"]
MW --> MR --> MD
MT -.->|"routes queries"| MR
end
What mempal Does
mempal uses two tiers: Wing and Room. The drawers table has wing TEXT NOT NULL and room TEXT columns. A separate taxonomy table (db.rs:43-49) maps (wing, room) pairs to display names and keyword lists. When a query arrives, mempal-search's routing module (crates/mempal-search/src/route.rs) matches query terms against taxonomy keywords to determine which Wing and Room to filter by.
Why the Change
Chapter 7's retrieval data tells the story. Wing filtering alone provides +12.2 percentage points. Adding Room filtering (Wing+Room combined) brings the total to +33.9 points. But Appendix D found that Hall is "narrated more completely than implemented" — searcher.py supports Wing and Room filtering explicitly, not Hall as a first-class routing target.
The practical reality is that most of the retrieval benefit comes from two operations: narrowing by project/domain (Wing) and narrowing by concept (Room). Hall's classification by memory type (facts vs. events vs. preferences) is theoretically useful but was not part of the default retrieval path in MemPalace's actual code.
Closet — the compressed AAAK summary tier — serves the same function as mempal's output-side AAAK formatting. Rather than storing compressed summaries as a persistent layer, mempal generates them on demand when requested. This eliminates the need to keep summaries synchronized with their source drawers.
The editable taxonomy is the key design replacement. Instead of a static five-tier hierarchy that must be defined upfront, mempal's taxonomy can be modified at runtime. mempal taxonomy edit <wing> <room> --keywords "auth,migration,clerk" updates routing keywords. MCP-connected agents can do the same via the mempal_taxonomy tool. The taxonomy adapts to actual usage patterns rather than requiring users to commit to a classification scheme before they know what they'll store.
This is not a claim that two tiers are always better than five. It is a claim that, given the evidence from Chapter 7 and Appendix D, two tiers plus an editable taxonomy capture most of the retrieval benefit while eliminating the implementation complexity of three tiers that were aspirational in MemPalace's codebase.
Dimension 2: Storage — ChromaDB to SQLite + sqlite-vec
What MemPalace Does
MemPalace uses ChromaDB as its vector store. palace_graph.py writes drawers to ChromaDB collections, and searcher.py queries them for semantic retrieval. ChromaDB provides embedding storage, similarity search, and metadata filtering in one package.
What mempal Does
mempal uses SQLite with the sqlite-vec extension. The drawers table stores content and metadata. The drawer_vectors virtual table (db.rs:27-30) stores 384-dimensional float vectors using vec0. Search queries use embedding MATCH vec_f32(?) for k-NN retrieval, joined with the drawers table for metadata filtering.
The entire database is one file: ~/.mempal/palace.db.
Why the Change
Three engineering requirements drove the switch:
Transactions and schema migration. SQLite provides ACID transactions and PRAGMA user_version for schema versioning. mempal's apply_migrations() function (db.rs) applies forward migrations automatically when opening a database. When we added deleted_at for soft-delete support, this was a one-line ALTER TABLE in a versioned migration. ChromaDB has no equivalent mechanism — schema changes require recreating collections.
Single-file portability. A SQLite database is one file. Backup is cp palace.db palace.db.bak. Transfer between machines is scp. There is no server process, no port, no data directory with multiple files. For a personal developer tool that might live in a dotfiles repository or be synced across machines, single-file storage eliminates an entire class of deployment problems.
Embedded deployment. SQLite is compiled into the binary via rusqlite's bundled feature. sqlite-vec is likewise bundled. There is no external process to start, no version compatibility to manage, no network connection to a vector database. The binary is self-sufficient.
What we gave up: ChromaDB's embedding management (mempal handles this separately via the mempal-embed crate and the Embedder trait), and ChromaDB's built-in collection-level isolation (mempal uses Wing/Room columns with SQL indexes instead). The tradeoff was worthwhile because the engineering requirements above — transactions, single-file, embedded — were non-negotiable for the single-binary product form.
Dimension 3: AAAK — From Heuristic to Formal
This is the dimension where the gap between MemPalace's design intent and its implementation was widest.
What MemPalace Does
dialect.py implements an AAAK encoder. It takes conversation text, selects key sentences, extracts top-frequency topics, detects entities (three uppercase letters), assigns emotion codes and semantic flags, and concatenates everything with pipe delimiters. The output looks like AAAK. But as Appendix C documented, there is no formal grammar defining valid AAAK, no decoder to reconstruct text from AAAK, and no round-trip test to verify information preservation.
What mempal Does
The mempal-aaak crate (crates/mempal-aaak/) implements four components that dialect.py was missing:
A BNF grammar. The design document (docs/specs/2026-04-08-mempal-design.md:209-229) defines AAAK syntax formally:
document ::= header NEWLINE body
header ::= "V" version "|" wing "|" room "|" date "|" source
zettel ::= zid ":" entities "|" topics "|" quote "|" weight "|" emotions "|" flags
tunnel ::= "T:" zid "<->" zid "|" label
arc ::= "ARC:" emotion ("->" emotion)*
A conforming parser. parse.rs validates documents against this grammar — checking that entity codes are exactly 3 uppercase ASCII characters, emotion codes are 3-7 lowercase characters, and tunnel references point to existing zettels. This means "valid AAAK" has a mechanical definition, not just a visual resemblance.
An encoder and decoder pair. codec.rs provides AaakCodec::encode() (producing an AaakDocument from raw text) and decode() (reconstructing readable text from AAAK by expanding entity codes back to full names using a bidirectional hash map). MemPalace had only the encode direction.
Round-trip verification. verify_roundtrip() (codec.rs:281-302) encodes text to AAAK, decodes it back, and calculates a coverage metric: preserved / (preserved + lost). Tests in aaak_test.rs verify that round-trip coverage meets a threshold (≥80%) and that any lost assertions are explicitly reported. This is the most important addition — it makes AAAK's information preservation empirically measurable rather than assumed.
Chinese: From Bigrams to Part-of-Speech Tagging
MemPalace's dialect.py handles Chinese text by generating CJK character bigrams — a character-level approach that can fragment meaningful words. The phrase "知识图谱" (knowledge graph) would become the bigrams "知识", "识图", "图谱" — two of which are meaningless fragments.
mempal uses jieba-rs (a Rust port of the jieba Chinese word segmenter) with part-of-speech tagging. codec.rs:579-609 calls jieba's POS tagger to identify proper nouns (nr, ns, nt, nz tags) for entity extraction. codec.rs:611-643 extracts content words (n*, v*, a* tags) for topic extraction, while filtering function words like pronouns and particles. The difference is structural: bigrams are character-level heuristics; POS tagging is word-level linguistic analysis.
Tests cover Chinese encoding explicitly: test_aaak_encode_chinese_text (line 326), test_aaak_encode_mixed_script_text_extracts_cjk_and_ascii_entities (line 377), and test_aaak_roundtrip_does_not_split_on_chinese_commas (line 421).
Dimension 4: Temporal Knowledge Graph — Schema Reserved, Logic Deferred
What MemPalace Does
Chapters 11-13 analyzed MemPalace's temporal knowledge graph: triples with valid_from and valid_to timestamps, contradiction detection across time, and timeline narrative generation. These are among the most intellectually ambitious features in the design.
What mempal Does
mempal preserves the schema. The triples table exists in db.rs:32-41:
CREATE TABLE triples (
id TEXT PRIMARY KEY,
subject TEXT NOT NULL,
predicate TEXT NOT NULL,
object TEXT NOT NULL,
valid_from TEXT,
valid_to TEXT,
confidence REAL DEFAULT 1.0,
source_drawer TEXT REFERENCES drawers(id)
);
The Triple struct in types.rs:23-32 mirrors this schema with valid_from: Option<String> and valid_to: Option<String>. But mempal does not implement automatic triple extraction from conversations, contradiction detection, or timeline narrative generation.
Why the Deferral
This is a prioritization judgment, not a design disagreement.
The temporal KG features depend on reliable knowledge graph population. In MemPalace, kg_add is a manual MCP tool — the AI explicitly writes triples. Automatic extraction from conversation text (identifying "Kai switched to the Orion project in March") requires either an LLM call per ingestion or a sophisticated NLP pipeline. Both add external dependencies that conflict with the zero-dependency local-first philosophy.
mempal's v1 priority was making the core pipeline reliable: ingest → embed → search → cite. Every feature in that pipeline had to work correctly before adding temporal reasoning on top. The schema reservation means that when the temporal KG is implemented, existing databases are ready — no migration needed for the triples table itself. But the implementation was honestly not ready for v1, and shipping an unreliable temporal reasoner would have been worse than shipping none.
Appendix D's observation applies here: it is better to ship what works than to narrate what is planned.
Dimension 5: MCP Tool Surface — 19 to 5
What MemPalace Does
Chapter 19 documented 19 MCP tools in 5 cognitive groups. The design is intellectually coherent — each group maps to a role the AI plays when interacting with memory.
What mempal Does
mempal exposes 5 tools: mempal_status, mempal_search, mempal_ingest, mempal_delete, and mempal_taxonomy. These map to the operations that are production-ready and actively used in daily development.
Why the Reduction
The reduction is not a judgment that 19 tools are wrong. It reflects a different stage of implementation maturity and a design choice about self-documentation.
Eight of MemPalace's 19 tools belong to the Knowledge Graph (5) and Navigation (3) groups. These depend on a fully populated knowledge graph and a working graph traversal engine — subsystems that Appendix D flagged as more narrated than exercised. Including tools for subsystems that are not yet reliable misleads agents into calling them and getting poor results.
The 5-tool surface also enables richer per-tool documentation. Each tool in mempal carries detailed field-level documentation — the wing field on SearchRequest (crates/mempal-mcp/src/tools.rs:11-16) explains exactly when to omit it and warns that guessing silently returns zero results. With 19 tools, this level of documentation per field would overwhelm the tool-list response. With 5 tools, each one can be thoroughly self-documenting.
The protocol-level design that makes 5 tools sufficient — the MEMORY_PROTOCOL embedded in MCP server instructions — is the subject of Chapter 28.
The Design Decision Table
mempal's design document (docs/specs/2026-04-08-mempal-design.md) captures these decisions in two tables. Here is the consolidated view:
| Dimension | MemPalace | mempal | Rationale |
|---|---|---|---|
| Spatial structure | Wing/Hall/Room/Closet/Drawer | Wing/Room + editable taxonomy | 33.9pp gain from Wing+Room; Hall/Closet underutilized (App D) |
| Storage | ChromaDB | SQLite + sqlite-vec | Transactions, single-file, embedded deployment |
| Embedding | ChromaDB built-in | ONNX (MiniLM) via Embedder trait | Offline-first, swappable models |
| AAAK | Heuristic encoder only | BNF grammar + encoder + decoder + round-trip | Fix Appendix C deficiencies |
| CJK processing | Character bigrams | jieba POS tagging | Word-level vs character-level |
| Search | Vector-only (ChromaDB) | Hybrid: BM25 (FTS5) + vector + RRF fusion | Exact keyword matching (error codes, function names) that pure vector search misses; inspired by qmd's hybrid approach |
| Temporal KG | Triples + contradiction + timeline | Triples activated (manual CRUD), contradiction/timeline deferred | Relationship queries vector search cannot answer; v1 manual-only to avoid LLM dependency |
| Tunnels | Auto cross-Wing links | Dynamic SQL discovery | Same concept as MemPalace Ch 6, zero-storage-cost implementation via GROUP BY room HAVING COUNT(DISTINCT wing) > 1 |
| MCP tools | 19 in 5 groups | 7 tools + self-describing protocol | Ship what works; mempal_kg and mempal_tunnels added for KG and cross-Wing discovery |
| Language | Python | Rust | Single binary, zero runtime deps (Ch 26) |
Every row in this table traces back to a finding in the book's first 25 chapters or appendices. The "rationale" column is not post-hoc justification — it is the analysis that preceded the implementation.
mempal Architecture Overview
The complete system, as it stands after all dimensions of change:
graph TD
subgraph "Interfaces"
CLI["CLI<br/>init/ingest/search/kg/tunnels"]
MCP["MCP Server<br/>7 tools + MEMORY_PROTOCOL"]
REST["REST API<br/>(feature-gated)"]
end
subgraph "Search Layer"
BM25["BM25<br/>FTS5"]
VEC["Vector<br/>sqlite-vec"]
RRF["RRF Fusion<br/>k=60"]
ROUTE["Taxonomy<br/>Router"]
TUNNEL["Tunnel<br/>Hints"]
BM25 --> RRF
VEC --> RRF
ROUTE --> RRF
RRF --> TUNNEL
end
subgraph "Storage (palace.db)"
DRAW["drawers<br/>raw text + wing/room<br/>+ importance + deleted_at"]
FTS["drawers_fts<br/>FTS5 index"]
DVEC["drawer_vectors<br/>embeddings (dynamic dim)"]
TRIP["triples<br/>S-P-O + temporal"]
TAX["taxonomy<br/>routing keywords"]
end
subgraph "Embedding"
M2V["model2vec<br/>(default, 256d)"]
ORT["ONNX MiniLM<br/>(optional, 384d)"]
API["External API<br/>(configurable)"]
end
subgraph "Output"
AAAK["AAAK Codec<br/>BNF + jieba"]
CITE["Citations<br/>drawer_id + source_file"]
end
CLI --> DRAW
MCP --> DRAW
REST --> DRAW
DRAW --> FTS
DRAW --> DVEC
M2V --> DVEC
ORT -.-> DVEC
API -.-> DVEC
DRAW --> AAAK
RRF --> CITE
This diagram shows the full data flow: content enters through any interface (CLI, MCP, REST), is stored in SQLite with raw text, FTS5 index, and vector embeddings. Search combines BM25 and vector paths through RRF fusion, filtered by taxonomy routing, with tunnel hints attached. Output can be raw or AAAK-compressed, always with citations.
What This Comparison Reveals
The pattern across all five dimensions is consistent: mempal preserves MemPalace's design ideas while simplifying or completing their implementation.
- Spatial structure: same idea (semantic partitioning improves retrieval), simpler implementation (two tiers instead of five)
- Storage: same idea (local single-file), different engine (SQLite instead of ChromaDB)
- AAAK: same idea (AI-readable compression), complete implementation (grammar + decoder + round-trip)
- Temporal KG: same idea (facts expire), honest deferral (schema ready, logic not)
- MCP tools: same idea (structured AI interface), focused surface (5 production-ready tools)
This is not a coincidence. It is the natural outcome of building from a detailed analysis rather than starting from scratch. The analysis told us what to keep and what to change — the implementation simply followed.
Chapter 28: Self-Describing Protocol
Positioning: This chapter examines mempal's most distinctive design: embedding behavioral instructions directly into the tool interface so that any AI agent learns how to use mempal correctly from the tool itself — no external documentation, no system prompt configuration. Prerequisite: Chapter 27 (what changed architecturally). Applicable scenario: when designing tools that AI agents will discover and use without human guidance.
The Problem: Tools That Cannot Teach
When an AI agent connects to an MCP server, it receives a list of tools. Each tool has a name, a description, and an input schema. The agent must decide — from this information alone — when to call which tool, what parameters to pass, and how to interpret results.
Chapter 19 analyzed MemPalace's 19-tool MCP surface. The tool descriptions tell the agent what each tool does. But they do not tell the agent when to use memory versus grepping files, how to discover valid wing names before filtering, or why citations matter. Those behavioral patterns were documented in README files and project guides — places that an MCP-connected agent never sees.
mempal's answer is not better tool descriptions. It is a behavioral protocol embedded in the tool interface itself.
Protocol as Code
The MEMORY_PROTOCOL lives in crates/mempal-core/src/protocol.rs as a Rust string constant:
#![allow(unused)] fn main() { pub const MEMORY_PROTOCOL: &str = r#"MEMPAL MEMORY PROTOCOL (for AI agents) You have persistent project memory via mempal. Follow these rules... "#; }
This constant is compiled into the mempal binary. It reaches AI agents through two paths:
graph LR
P["protocol.rs<br/>MEMORY_PROTOCOL const"] --> S["server.rs<br/>ServerInfo::with_instructions()"]
P --> ST["mempal_status tool<br/>memory_protocol field"]
S --> C1["Agent connects via MCP<br/>→ reads initialize response<br/>→ learns rules immediately"]
ST --> C2["Agent calls mempal_status<br/>→ reads protocol in response<br/>→ fallback for clients that<br/>ignore initialize.instructions"]
The primary path is ServerInfo::with_instructions() in crates/mempal-mcp/src/server.rs. The MCP specification defines an instructions field on the server info response — most MCP clients inject this into the LLM's system prompt at connection time. By putting the protocol there, mempal teaches every connected agent its behavioral rules before the agent makes its first tool call.
The fallback path is the mempal_status tool, which returns the same protocol text in its memory_protocol response field. This covers clients that ignore the initialize.instructions field — the agent can still discover the protocol by calling status.
The protocol text lives next to the code, not in a separate documentation file. The module-level doc comment on protocol.rs explains why:
#![allow(unused)] fn main() { //! This is embedded in MCP status responses and CLI wake-up output, //! following the same self-describing principle as `mempal-aaak::generate_spec()`: //! the protocol lives next to the code so it cannot drift. }
If the protocol said "call mempal_status to discover wings" but mempal_status stopped returning wing data, the protocol would be wrong. By keeping them in the same codebase — and testing that MEMORY_PROTOCOL contains expected keywords in mcp_test.rs — the text and behavior stay synchronized.
Seven Rules, Seven Failures
The MEMORY_PROTOCOL contains seven rules (numbered 0 through 5, plus 3a). Each rule exists because a real failure happened during mempal's development. This is not theoretical API design — it is post-incident documentation encoded as behavioral instructions.
Rule 0: FIRST-TIME SETUP
Call mempal_status() once at the start of any session to discover available wings and their drawer counts.
The failure: In a fresh Codex session, a user asked about AAAK's Chinese word segmentation implementation. Codex correctly called mempal_search — but passed {"wing": "engineering"}. mempal's wing filter is strict equality. The only wing in the database was "mempal". The query returned zero results, and Codex fell back to reading source code directly, bypassing memory entirely.
Root cause (documented in drawer_mempal_mempal_mcp_a916f9dc): Three things converged. SearchRequest.wing had no doc comment, so the JSON schema gave no guidance on when to omit it. The field name "wing" invited guessing. And nothing told fresh clients to call mempal_status first to discover valid wing names.
The fix: Rule 0 tells agents to call mempal_status() once per session before using wing filters. The status response includes a scopes array listing every (wing, room, drawer_count) triple. After reading this, the agent knows the exact wing names and can filter correctly — or leave the filter unset for a global search.
Rule 1: WAKE UP
Some clients pre-load recent wing/room context. Others do NOT — for those, step 0 is how you wake up.
The failure: The original Rule 1 assumed all clients pre-loaded context via session-start hooks. This was true for Claude Code (which has SessionStart hooks) but false for Codex, Cursor, and raw MCP clients. The assumption caused Rule 0 to not exist initially — it was added after the Codex wing-guessing incident.
The fix: Rule 1 now explicitly distinguishes between clients with pre-load mechanisms and those without, directing the latter to Rule 0.
Rule 2: VERIFY BEFORE ASSERTING
Before stating project facts, call mempal_search to confirm. Never guess from general knowledge.
The failure (documented in drawer_mempal_default_cb58c7f3): Claude was observed making project-specific claims without consulting memory. In one instance, Claude stated "mempal_search cannot retrieve by drawer_id, we need a new mempal_get_drawer tool." In reality, Claude had used a direct sqlite3 shell command as a side-door, then framed the limitation as an MCP gap. The actual MCP search works correctly when given a semantic query instead of an opaque ID.
The fix: Rule 2 requires agents to call mempal_search before asserting project facts. This prevents an agent from hallucinating project state or proposing unnecessary tool additions based on incorrect assumptions.
Rule 3: QUERY WHEN UNCERTAIN
When the user asks about past decisions or historical context, call mempal_search. Do not rely on conversation memory alone.
The failure: Across multiple sessions, agents would answer "why did we choose X?" questions from their training data's general knowledge rather than from the project's actual decision history. A user asking "why did we switch from ChromaDB to SQLite?" would get a generic answer about SQLite advantages, not the specific engineering rationale documented in mempal drawers.
The fix: Rule 3 explicitly triggers on patterns like "why did we...", "last time we...", and "what was the decision about..." — phrases that signal project-specific historical questions.
Rule 3a: TRANSLATE QUERIES TO ENGLISH
The embedding model (MiniLM) is English-centric. Non-English queries produce poor vector representations.
The failure: During dogfooding in this session, a Chinese query — "它不再是一个高级原型" (it is no longer just an advanced prototype) — returned completely irrelevant results (AAAK documentation instead of the target status snapshot). The same query translated to English — "no longer just an advanced prototype" — hit the correct drawer immediately.
Root cause: MiniLM-L6-v2 has sparse CJK token coverage. Chinese text fragments into unknown-token embeddings with low semantic fidelity. The vector representation of the Chinese query was so poor that it matched by accident rather than by meaning.
The fix: Rule 3a tells agents to translate non-English queries into English before passing them to mempal_search. This is a zero-cost fix — the agents performing the search are LLMs that can translate natively. The rule includes a concrete example to make the expected behavior unambiguous.
Rule 4: SAVE AFTER DECISIONS
When a decision is reached in conversation, call mempal_ingest to persist it. Include the rationale, not just the decision.
The failure: In an early session, Claude completed a significant implementation (adding CI workflows) and immediately asked "want to commit?" — without saving a decision record to mempal. Codex, picking up the next session, had no record of why the CI was structured that way, what was deliberately omitted (rustfmt), or what the follow-up priorities were. The handoff relied entirely on git commit messages, which capture what changed but not why.
The fix: Rule 4 makes decision persistence explicit. "Include the rationale, not just the decision" is the key phrase — a drawer that says "added CI" is nearly useless; a drawer that says "added CI with default + all-features matrix, deliberately omitted rustfmt because formatting drift exists, follow-up: cargo fmt --all then add fmt check" is the kind of context that enables cross-session continuity.
Rule 5: CITE EVERYTHING
Every mempal_search result includes drawer_id and source_file. Reference them when you answer.
The failure: Without this rule, agents would search mempal, find relevant information, and then present it as their own knowledge — "we decided to use SQLite for single-file portability" — without attribution. The user has no way to verify the claim, trace it to its source, or assess its age.
The fix: Rule 5 requires explicit citations: "according to drawer_mempal_default_2fd6f980, we decided...". Citations serve three purposes: they let the user verify the source, they make the agent's reasoning auditable, and they distinguish memory-backed claims from hallucination.
Field-Level Documentation: Teaching Through Schema
The protocol teaches behavioral rules. But there is a second layer of self-documentation: field-level doc comments on tool input types that propagate into the MCP schema.
SearchRequest in crates/mempal-mcp/src/tools.rs demonstrates this pattern:
#![allow(unused)] fn main() { #[derive(Debug, Clone, Deserialize, JsonSchema)] pub struct SearchRequest { /// Natural-language query. Use the user's actual question verbatim /// when possible — the embedding model handles paraphrase and translation. pub query: String, /// Optional wing filter. OMIT (leave null) unless you already know the /// EXACT wing name from a prior mempal_status call or the user named it /// explicitly. Wing filtering is a strict equality match, so guessing a /// wing name (e.g. "engineering", "backend") will silently return zero /// results. When in doubt, leave this field unset for a global search /// across all wings. pub wing: Option<String>, // ... } }
The #[derive(JsonSchema)] macro from schemars converts these doc comments into JSON Schema description fields. When an MCP client calls tools/list, it receives the tool's input schema — including these descriptions. The agent reads "guessing a wing name will silently return zero results" directly from the tool definition, before it ever considers calling the tool.
This is the propagation chain:
Rust doc comment → schemars derive → JSON Schema description → MCP tools/list response → agent's tool-selection context
The chain means that improving a doc comment in Rust source code automatically improves the guidance every agent receives. No documentation site to update, no system prompt to modify, no client configuration to change. The guidance travels with the tool definition.
A test in mcp_test.rs guards this: test_mempal_search_schema_warns_about_wing_guessing lists tools, finds mempal_search, and asserts the serialized input_schema contains both "OMIT" and "global search". This prevents a future refactor from silently stripping the guidance.
19 Tools to 7: Less Is More (With Context)
Chapter 19 documented MemPalace's 19 tools organized into 5 cognitive roles. mempal has 7 tools. This section explains why the reduction works — and what it depends on.
What 7 Tools Cover
| Tool | Role | Replaces from MemPalace |
|---|---|---|
mempal_status | Observe | status, list_wings, list_rooms, get_aaak_spec |
mempal_search | Retrieve | search, check_duplicate (hybrid: BM25 + vector + RRF) |
mempal_ingest | Write | add_drawer (supports dry_run preview) |
mempal_delete | Write | delete_drawer (soft-delete with audit) |
mempal_taxonomy | Configure | get_taxonomy (read) + taxonomy edit (new) |
mempal_kg | Knowledge | kg_add, kg_query, kg_invalidate (manual triples CRUD) |
mempal_tunnels | Navigate | find_tunnels (dynamic cross-Wing room discovery) |
What Is Still Missing
Six tools from MemPalace's surface have no mempal equivalent:
- KG timeline and stats (2 tools:
kg_timeline,kg_stats): mempal'smempal_kgcovers add/query/invalidate but not timeline narrative generation or graph statistics. These depend on a more populated knowledge graph. - Navigation group (2 tools:
traverse,graph_stats): Graph traversal requires richer inter-drawer edges beyond the current tunnel mechanism. - Diary group (2 tools:
diary_write,diary_read): Agent specialist diaries are not yet implemented.
These are not rejected — they are deferred until the subsystems they depend on are production-ready.
Why 7 Works
The 7-tool surface works because of two design decisions that MemPalace did not have:
1. The protocol compensates for missing tools. MemPalace needed list_wings and list_rooms as separate tools because there was no mechanism to tell agents when to use them. mempal's mempal_status returns wing/room data and the protocol tells agents (Rule 0) to call it at session start. One tool replaces three because the behavioral context is embedded.
2. Self-documenting fields reduce per-call confusion. MemPalace needed check_duplicate as a separate tool because agents had no way to know whether a drawer already existed before writing. mempal's mempal_ingest handles deduplication internally — drawer_exists() is called before insertion. The agent does not need to check separately.
3. Consolidated tools do more per call. MemPalace split KG into 5 separate tools (kg_query, kg_add, kg_invalidate, kg_timeline, kg_stats). mempal's mempal_kg handles add/query/invalidate as actions within a single tool — the agent passes {"action": "query", "subject": "Kai"} instead of choosing between 5 tools. Fewer tools, same capability surface.
The lesson is not "fewer tools are always better." It is that tools and protocol are complementary surfaces. When the protocol carries behavioral guidance and tools consolidate related actions, each tool can do more with less cognitive overhead on the agent's side.
The Self-Description Principle
The design pattern behind MEMORY_PROTOCOL is not unique to mempal. It extends to other parts of the system:
-
AAAK spec generation:
mempal-aaak'sgenerate_spec()function produces the AAAK format specification dynamically from the codec's own constants (emotion codes, flag names, delimiter rules). The spec is always consistent with the encoder because it is generated from the same source. -
CLI wake-up:
mempal wake-upoutputs the same MEMORY_PROTOCOL text to stdout. An AI agent that reads CLI output (rather than connecting via MCP) still learns the behavioral rules. -
Status response:
mempal_statusreturns not just data but the protocol text and the AAAK spec. A single tool call gives the agent everything it needs to operate correctly.
The common principle: the tool teaches the agent how to use it, from the tool itself. No external documentation required, no system prompt configuration assumed, no version drift between documentation and implementation.
This principle has a limitation: it only works when the consumers are AI agents that can read and follow natural-language instructions. If mempal gained human users who interact through a GUI, the self-describing protocol would not help them. The design is deliberately optimized for AI consumers — which is the only audience mempal targets.
Agent Session Lifecycle
Putting all the pieces together — protocol, tools, and search — here is how a typical agent session flows:
sequenceDiagram
participant A as AI Agent
participant M as mempal MCP
participant DB as palace.db
Note over A,M: Connection (automatic)
M->>A: initialize.instructions = MEMORY_PROTOCOL
Note over A: Agent learns 9 rules
Note over A,M: Rule 0: First-Time Setup
A->>M: mempal_status()
M->>DB: scopes, counts
DB-->>M: wings=[mempal], drawers=120
M-->>A: status + protocol + AAAK spec
Note over A,M: Rule 3: Query When Uncertain
A->>M: mempal_search("auth decision")
M->>DB: BM25(FTS5) + Vector(sqlite-vec)
DB-->>M: RRF merged results + tunnel hints
M-->>A: [{drawer_id, content, source_file, tunnel_hints}]
Note over A,M: Rule 4: Save After Decision
A->>M: mempal_ingest({content, wing, importance:4})
M->>DB: dedup check → insert drawer + vector + FTS
DB-->>M: drawer_id (+ duplicate_warning if similar)
M-->>A: {drawer_id}
Note over A,M: Rule 5a: Keep a Diary (optional)
A->>M: mempal_ingest({wing:"agent-diary", room:"claude"})
M->>DB: insert behavioral observation
M-->>A: {drawer_id}
This lifecycle is entirely self-bootstrapping. The agent receives the protocol at connection time, discovers wings via mempal_status, searches with hybrid retrieval, saves decisions with importance ranking, and optionally records behavioral observations in a diary. No external configuration required.
What This Means for Tool Design
mempal's self-describing protocol is a specific instance of a broader design question: how should tools teach their users?
Traditional tools rely on documentation — man pages, README files, API reference sites. The documentation is written once and maintained separately from the code. It drifts. Users who discover the tool through package managers or tool registries may never find the documentation.
For AI-consumed tools, the tool definition is the documentation. The MCP tools/list response is the only context the agent has. Every piece of guidance that is not in the tool definition — not in the description, not in the field-level schema, not in the initialize.instructions — does not exist for the agent.
mempal's answer is to put behavioral guidance in three places that the agent will definitely see: initialize.instructions (automatic at connection), mempal_status response (automatic at first call), and field-level description on every input type (visible at every tool call). Redundancy is deliberate — different clients read different parts of the interface.
The overhead of this approach is that the protocol text consumes tokens in the agent's context. The MEMORY_PROTOCOL is roughly 500 tokens. For an agent with a 100K+ context window, this is negligible. For a hypothetical agent with a 4K window, it would be expensive. mempal bets on the trajectory of context windows growing, not shrinking.
Chapter 29: Multi-Agent Coordination
Positioning: This chapter documents how mempal became an asynchronous coordination layer between different AI agents — a use case that was discovered during development, not designed upfront. Prerequisite: Chapter 28 (the protocol that enables agent interaction). Applicable scenario: when multiple AI agents work on the same project across separate sessions and need shared context.
An Unplanned Discovery
mempal was designed as a memory tool for coding agents. It was not designed as a coordination mechanism. But during development, something unexpected happened: two AI agents — Claude (via Claude Code) and Codex (via OpenAI Codex CLI) — began using mempal drawers to hand off context to each other across sessions. Neither agent was in the same session. Neither could communicate directly. Yet they coordinated effectively, because both could read and write to the same memory.
This chapter tells that story with specific evidence — drawer IDs, commit hashes, and timestamps. Every claim is traceable to a mempal drawer or a git commit.
The First Relay
The first successful Claude↔Codex relay happened on April 10, 2026, over three sessions.
sequenceDiagram
participant U as User
participant CX as Codex Session 1
participant MP as mempal (palace.db)
participant CC as Claude Session 2
participant CX2 as Codex Session 3
U->>CX: "Clean up the working tree, fix clippy"
CX->>CX: Commits 4 grouped commits + clippy fix
CX->>MP: Writes status snapshot (drawer a295458d)
Note over MP: "95 tests pass, clippy clean,<br/>priority #1: add CI"
U->>CC: "Add CI for the project"
CC->>MP: Reads drawer a295458d
CC->>CC: Writes .github/workflows/ci.yml
CC->>MP: Writes CI decision (drawer b103b147)
Note over MP: "CI added (f094cb0),<br/>deliberately omitted rustfmt"
U->>CX2: "Review what Claude did"
CX2->>MP: Reads drawers a295458d + b103b147
CX2->>CX2: Finds --all-features missing from CI
CX2->>CX2: Commits fix (4fac199)
CX2->>MP: Writes review finding (drawer cb58c7f3)
Note over MP: "Claude skip-repo-docs<br/>anti-pattern documented"
Session 1: Codex Establishes Baseline
The user asked Codex to clean up an untidy working tree. Codex committed four groups of changes, fixed a clippy blocker, and wrote a status snapshot to mempal:
drawer_mempal_default_a295458d: "mempal moved from a dirty advanced prototype to a much safer internal-tool state... Remaining highest-priority gaps: 1. Add minimal CI for test/build/clippy. 2. Fix ingest source_file path-normalization bug."
This drawer contains more than a status update. It contains a prioritized task list — the kind of context that normally lives in a project manager's head or a tracking tool. By writing it to mempal, Codex made its judgment available to any future agent that searches for "what should we do next."
Session 2: Claude Reads and Acts
In a separate session, the user asked Claude to add CI. Claude searched mempal, found drawer_mempal_default_a295458d, and saw "Add minimal CI" as priority #1. Claude wrote the workflow and committed it as f094cb0, then saved its own decision record:
drawer_mempal_default_b103b147: "Added minimal GitHub Actions CI workflow. Commit f094cb0... Deliberate omission: cargo fmt --check is NOT in this first iteration... Important nuance from the Codex/Claude handoff pattern: Codex executed a295458d (commit + clippy fix), Claude executed this (CI). Neither agent did both — explicit division of labor across sessions, with drawer-based handoff as the coordination mechanism."
Claude's drawer explicitly documents the handoff pattern. It notes what was deliberately omitted (rustfmt) and why (formatting drift would block CI). This is the kind of rationale that git commit messages rarely capture.
Session 3: Codex Reviews and Catches a Gap
The user returned to Codex for a review. Codex read both drawers, then inspected the CI workflow Claude had written. It found a gap: the workflow only ran default-feature commands, missing the --all-features flag that the project's own README and docs specified for verification.
Codex committed the fix (4fac199), then wrote its observation to mempal:
drawer_mempal_default_cb58c7f3: "Claude 'skip-repo-docs' anti-pattern in mempal infra work — observed 3 times in one session... Claude's skipped action: grep/search repo docs and prior mempal drawers for canonical conventions."
This is remarkable. Codex did not just fix the bug. It identified a behavioral pattern across Claude's work — three instances of the same mistake in one session — and documented it as a named anti-pattern. The drawer includes root causes, specific instances, and a proposed rule to prevent recurrence.
Decision Memory vs. Git Diff
The relay above demonstrates a fundamental difference between what git captures and what mempal captures. Consider commit f094cb0 — Claude's CI workflow:
What git log says:
f094cb0 ci: add minimal GitHub Actions workflow for test, clippy, build
What git diff shows:
A .github/workflows/ci.yml file with three jobs: clippy, test, and build.
What mempal drawer b103b147 says:
- This completes priority #1 from Codex's status snapshot
- Uses
dtolnay/rust-toolchain@stable+Swatinem/rust-cache@v2 - Ubuntu-latest, no system deps needed because
ortusesdownload-binariesandsqlite-vecis vendored - Verified locally before committing: all three commands pass
- Deliberately omitted
cargo fmt --checkbecause formatting drift exists in at least two test files - Follow-up work: single commit
cargo fmt --all+ add fmt step to CI - First successful Claude↔Codex relay via mempal drawers
The git history tells you what changed. The mempal drawer tells you why it changed that way, what was considered and rejected, what remains to do, and how this action connects to the larger project trajectory. The two are complementary — neither replaces the other.
This complementarity is precisely why Rule 4 (SAVE AFTER DECISIONS) exists in the MEMORY_PROTOCOL. Without it, the relay would have broken at Session 2: Claude would have committed the CI workflow but left no context for Codex to understand the deliberate omissions.
Anti-Pattern Discovery
The most unexpected outcome of the Claude↔Codex relay was not the successful handoff — it was the emergence of cross-agent code review.
Codex's drawer_mempal_default_cb58c7f3 documented three instances of the same pattern in Claude's work:
1. Wing guessing: Claude added MEMORY_PROTOCOL and SearchRequest doc comments without first checking what wing names actually existed in the database. The fix was correct, but it could have been informed by existing data.
2. Unnecessary tool proposal: Claude proposed adding a mempal_get_drawer tool after using a direct sqlite3 shell command as a workaround. The existing mempal_search already handled the use case — Claude didn't verify this before proposing.
3. CI missing --all-features: Claude wrote CI based on "how CI usually looks" without reading the project's own README, which specified --all-features in three separate places.
Codex's analysis identified the common thread: "Claude's first action when writing infrastructure: generate from implicit knowledge of 'how this kind of thing usually looks.' Claude's skipped action: grep repo docs and prior mempal drawers for canonical conventions."
To be clear, the pattern discovery was not one-directional. Codex had its own failures during the same period. The wing-guessing incident — passing {"wing": "engineering"} when the only wing was "mempal" — was a Codex mistake, not a Claude one. Codex also triggered a data-loss incident during a cleanup operation, sweeping 69 drawers with an overly broad DELETE WHERE source_file LIKE '...' clause (documented in drawer_mempal_mempal_mcp_a916f9dc). Both agents made mistakes; the difference was that mempal made those mistakes visible and documentable.
This is a code review — but not of code. It is a review of agents' behavioral patterns, conducted by each agent observing the other's work, stored in the same memory system both agents use. The anti-pattern documentation then becomes available in future sessions, creating a feedback loop: one agent observes → documents → the other reads in next session → adjusts behavior.
The feedback loop only works because both agents share the same memory. Without mempal, these observations would have been trapped in conversation transcripts that neither agent sees in subsequent sessions.
Dogfooding: What Worked and What Did Not
mempal uses itself to remember its own development decisions. This section evaluates that dogfooding honestly.
What Worked
Cross-session context transfer. The primary value proposition delivered. When a new session starts, the agent calls mempal_status, searches for recent decisions, and picks up where the previous session left off. The drawer-based handoff replaced what would otherwise be manual briefings ("here's what we did last time...").
Prioritized task handoffs. Codex's status snapshot (drawer_mempal_default_a295458d) included a ranked priority list. Claude read it and executed priority #1. This is more structured than a TODO file because the priorities include rationale and context.
Behavioral pattern tracking. The skip-repo-docs anti-pattern was identified, documented, and made available for future correction — all within the memory system itself. This demonstrates that memory tools can serve not just as information stores but as behavioral feedback mechanisms.
Cross-agent accountability. Every decision is attributed to a session and agent. When the --all-features gap was found, the drawer trail made it clear: Claude wrote the CI, Codex caught the gap. This is not blame — it is traceability.
What Did Not Work
Non-English search degradation. When the user asked a Chinese-language question about mempal's status, the search returned irrelevant results. The MiniLM embedding model's sparse CJK coverage meant that Chinese queries produced vectors with low semantic fidelity. The protocol-level fix (Rule 3a: translate to English) is a workaround, not a solution. Agents that forget to translate — or that connect through clients that do not inject the protocol — will still get poor results for non-English queries.
Data loss from unguarded deletion. During the wing-guessing fix session, a cleanup script with an overly broad DELETE WHERE source_file LIKE '...' clause swept 69 drawers, including at least one narrative decision drawer (drawer_mempal_mempal_mcp_4b55f386) that had no file-backed recovery path. This incident directly motivated the soft-delete safeguards implemented later — mempal delete now marks drawers with deleted_at rather than physically removing them, and mempal purge is the explicit "I'm sure" step.
Protocol compliance is voluntary. The MEMORY_PROTOCOL tells agents to save decisions (Rule 4) and cite sources (Rule 5). But agents sometimes forget. In this very session, Claude completed a significant implementation without saving a decision record — until the user pointed out that Codex always does it automatically. The protocol can instruct, but it cannot enforce.
Drawer discovery depends on search quality. If a decision was stored with poor semantic hooks (vague content, no distinctive keywords), future searches may not find it. The memory system is only as good as the content ingested. There is no mechanism to assess or improve drawer quality after the fact.
Agent Diary: Behavioral Learning Across Sessions
The dogfooding experience led to a structured approach for recording behavioral observations: the agent diary. MEMORY_PROTOCOL Rule 5a tells agents to write diary entries to wing="agent-diary" with room set to the agent's name, using standardized prefixes:
- OBSERVATION: factual behavioral patterns ("Claude forgets to save decisions after commits")
- LESSON: actionable takeaways ("FTS5 BM25 was the highest-value improvement")
- PATTERN: recurring cross-session behaviors ("Codex reads docs first, Claude generates first")
The diary is not a separate feature — it uses the same mempal_ingest tool with a wing/room convention. But the structured prefixes make diary entries searchable by type:
mempal search "lesson" --wing agent-diary --room claude
Combined with Claude Code's auto-dream feature (which consolidates session memory between sessions), the diary creates a feedback loop: agents observe their own behavior → record patterns → future sessions read the diary → adjust behavior. This is the closest thing to "learning from experience" that a stateless LLM can achieve — not by changing weights, but by accumulating searchable observations in shared memory.
What This Pattern Suggests
The Claude↔Codex relay via mempal drawers is a specific instance of a general pattern: asynchronous coordination through shared memory.
Traditional multi-agent systems use direct messaging, shared queues, or orchestration frameworks. These require the agents to be online simultaneously or to share a communication protocol. The mempal pattern requires neither. An agent writes a drawer. Hours or days later, a different agent — possibly a different model, from a different vendor, running in a different tool — searches for that topic and finds the drawer. Coordination happens through semantic search over shared state, not through direct communication.
This pattern has three properties worth noting:
Vendor independence. Claude and Codex are different models from different companies. They coordinate not because they share an architecture or a protocol, but because they both support MCP and can both read and write natural-language drawers. Any agent that connects to mempal's MCP server can participate — the coordination layer is the memory, not the agent.
Asynchronous by default. There is no "session handoff" protocol. The writing agent does not know who will read the drawer. The reading agent does not know who wrote it. They are decoupled in time and identity. The only contract is the MEMORY_PROTOCOL: save decisions with rationale, cite sources, verify before asserting.
Emergent review. Nobody designed a "cross-agent code review" feature. It emerged from the combination of shared memory, decision persistence, and different agents with different behavioral patterns. Codex's tendency to check documentation before writing complemented Claude's tendency to generate from first principles. The memory system made this complementarity visible and actionable.
Whether this pattern generalizes beyond two agents working on a single Rust project remains to be seen. But the evidence from mempal's own development suggests that shared, citation-bearing memory is a sufficient substrate for meaningful multi-agent coordination — no orchestration framework required.
Chapter 30: The Honest Gap
Positioning: This chapter documents what mempal is not yet, with data. Rather than letting readers discover limitations, we state them first. Prerequisite: Chapters 26-29 (what was built). Applicable scenario: when evaluating whether mempal fits your use case.
The Numbers
Before discussing gaps, here are the benchmark results that ground this chapter. All data from benchmarks/longmemeval_s_summary.md, run locally on the LongMemEval s_cleaned dataset (500 questions, 53 sessions).
384d Baseline (ONNX MiniLM-L6-v2)
| Mode | R@1 | R@5 | R@10 | NDCG@10 | Time |
|---|---|---|---|---|---|
| raw + session | 0.806 | 0.966 | 0.982 | 0.889 | 415s |
| aaak + session | 0.830 | 0.952 | 0.974 | 0.892 | 502s |
| rooms + session | 0.734 | 0.878 | 0.896 | 0.808 | 422s |
256d Model2Vec (potion-base-8M)
| Mode | R@1 | R@5 | R@10 | NDCG@10 | Time |
|---|---|---|---|---|---|
| raw + session | 0.816 | 0.952 | 0.976 | 0.888 | 102s |
| aaak + session | 0.806 | 0.948 | 0.972 | 0.883 | 116s |
| rooms + session | 0.744 | 0.868 | 0.890 | 0.808 | 84s |
Comparison with MemPalace
| System | Mode | R@5 | API Calls |
|---|---|---|---|
| mempal (384d) | raw | 96.6% | Zero |
| mempal (256d) | raw | 95.2% | Zero |
| MemPalace | raw | 96.6% | Zero |
| mempal (384d) | AAAK | 95.2% | Zero |
| MemPalace | AAAK | 84.2% | Zero |
| MemPalace | hybrid+rerank | 100% | API calls |
graph LR
subgraph "mempal"
M1["raw 95.2%<br/>(model2vec 256d)"]
M2["raw 96.6%<br/>(ONNX 384d)"]
M3["AAAK 95.2%"]
end
subgraph "MemPalace"
P1["raw 96.6%"]
P2["AAAK 84.2%"]
P3["hybrid+rerank 100%"]
end
M2 -.->|"matches"| P1
M3 -->|"+11pp"| P2
P3 -->|"mempal lacks<br/>reranking"| M1
The honest reading: mempal matches MemPalace on raw retrieval and significantly outperforms on AAAK (95.2% vs 84.2%, an 11 percentage-point improvement from the BNF grammar + jieba rework). But mempal does not reach MemPalace's hybrid+rerank 100% — that path requires API calls for reranking, which mempal deliberately omits in its zero-dependency default configuration.
Gap 1: Model2Vec vs Full Transformer Quality
The switch from ONNX MiniLM (384d) to model2vec (256d) trades retrieval quality for deployment simplicity:
| Metric | 384d | 256d | Delta |
|---|---|---|---|
| R@5 (raw) | 0.966 | 0.952 | -1.4pp |
| NDCG@10 | 0.889 | 0.888 | -0.001 |
| Speed | 415s | 102s | 4x faster |
| Native deps | ONNX Runtime (~200MB) | None | Zero |
The tradeoff is measurable: 1.4 percentage points of R@5 for zero native dependencies and 4x speed. For a personal developer tool, this is likely acceptable. For a system where every fraction of a percent matters (medical records, legal discovery), it would not be.
The onnx feature flag preserves the full-transformer path for users who want maximum quality at the cost of a heavier install.
Gap 2: Non-English Search Degradation
Tested empirically during development: the Chinese query "它不再是一个高级原型" returned irrelevant AAAK documentation instead of the target status snapshot. The same query translated to English — "no longer just an advanced prototype" — hit the correct drawer immediately.
The model2vec multilingual model (potion-multilingual-128M) improves this — Chinese queries no longer completely miss — but English queries still retrieve more reliably. The practical gap:
- English query: target drawer typically in top-1 or top-3
- Chinese query: target drawer may appear in top-3 but with lower similarity, or be pushed out by false positives
MEMORY_PROTOCOL Rule 3a ("TRANSLATE QUERIES TO ENGLISH") is a workaround that leverages the fact that all mempal consumers are LLMs capable of translation. This is not a model-level fix. A proper solution would require either a stronger multilingual embedding model or a Chinese-specific FTS5 tokenizer — both of which add complexity that conflicts with the zero-dependency goal.
Gap 3: No Reranking
MemPalace achieves 100% R@5 with hybrid search + reranking (using API-based cross-encoder). mempal's hybrid search (BM25 + vector + RRF) reaches 95-96% without reranking.
The Reranker trait exists (crates/mempal-search/src/rerank.rs) with a NoopReranker default. An ONNX cross-encoder implementation would close the gap, but it would add ~50-600MB of model weight depending on the reranker chosen, breaking the "light binary" promise.
The architectural decision: reranking is an optional enhancement, not a default. The 4-5 percentage point gap between RRF-only and reranked results is real, but for the typical use case (finding a decision made last week, not searching a million-document corpus), RRF is sufficient.
Gap 4: Knowledge Graph Is Manual
mempal's triples table is activated — agents can add, query, invalidate, and browse timelines. But there is no automatic extraction. When an agent saves "Kai recommended Clerk over Auth0 based on pricing and DX" via mempal_ingest, no triples are automatically created. The agent must explicitly call mempal_kg add "Kai" "recommends" "Clerk".
MEMORY_PROTOCOL does not yet include a rule for automatic triple extraction (the proposed Rule 4a was discussed but not implemented in protocol). This means the knowledge graph grows only when agents remember to populate it — which, based on our experience with Rule 4 (SAVE AFTER DECISIONS), they sometimes forget.
The alternative — LLM-based extraction at ingest time — conflicts with the local-first, zero-API-call philosophy. Every ingest would require an LLM call to extract entities and relationships. For a tool that processes hundreds of drawers during mempal ingest, this would be prohibitively slow and expensive.
Gap 5: Taxonomy Routing Underperforms
The benchmark data reveals an uncomfortable truth: taxonomy-based room routing (rooms mode) consistently underperforms raw search.
| Mode | 384d R@5 | 256d R@5 |
|---|---|---|
| raw | 0.966 | 0.952 |
| rooms | 0.878 | 0.868 |
Room routing loses 8-9 percentage points compared to raw search. This means the taxonomy is currently hurting retrieval precision on LongMemEval, not helping it.
The likely cause: LongMemEval's question distribution does not align well with mempal's default taxonomy. Questions that span multiple rooms get routed to the wrong scope. Chapter 7's 34% improvement from Wing/Room filtering was measured on MemPalace's benchmark, where the taxonomy was presumably tuned for the data. mempal's auto-detected taxonomy from mempal init may not be well-calibrated for arbitrary datasets.
This does not invalidate spatial structure as a concept — it validates the design decision to make taxonomy editable rather than fixed. But it does mean that out-of-the-box taxonomy routing needs improvement, either through better auto-detection heuristics or through a "tune taxonomy from search feedback" mechanism that does not yet exist.
Gap 6: Tunnel Discovery Is Forward-Looking
Tunnels work — we demonstrated this with a live example (the mempal-mcp room appearing in both mempal and hermes-agent wings). But with most users having only one wing, tunnels provide zero value in practice.
The feature is architecturally sound (dynamic SQL discovery, inline search hints, zero storage cost) but awaits the multi-project use case to prove its worth. This is a feature that was built from analysis (Chapter 6) rather than from user demand — the honest assessment is that it may never be used by most users.
What These Gaps Mean
The gaps fall into three categories:
Acceptable tradeoffs (Gaps 1, 3): model2vec quality and lack of reranking are deliberate choices — speed and simplicity over marginal precision. The onnx feature and Reranker trait preserve upgrade paths.
Known limitations with workarounds (Gaps 2, 4): non-English search and manual KG are real limitations, but protocol-level workarounds (Rule 3a, potential Rule 4a) mitigate them for the target audience (AI agents that can translate and extract).
Unproven features (Gaps 5, 6): taxonomy routing regression and tunnel underuse suggest that some features were built from analysis rather than validated by usage. They need real-world feedback to prove or disprove their value.
The Remaining Blockers
For crates.io public release: the main blocker is no longer code quality (CI is green, tests pass, clippy is clean) but release process — token management, publish order, version policy. The benchmark data in this chapter provides the credibility foundation that was previously missing.
For broad adoption: the gap between "works for the author" and "works for anyone" is still uncrossed. Installation is a single cargo install, but the first-run experience (model download, mempal init, understanding wing/room concepts) has not been tested with users who did not build the tool.
For the book itself: this chapter closes the narrative loop that Part 10 opened. Twenty-five chapters analyzed MemPalace. Five chapters documented the rewrite. This chapter provides the honest self-assessment that makes the analysis credible — because a book that only praises its subject is marketing, not engineering.
Appendix A: E2E Trace — From mempalace init to First Search
This appendix traces a complete user journey: from
pip install mempalaceto the first search returning results. Every step is annotated with source code locations, and every data flow can be verified in the codebase. Related chapters: Chapter 5 (Wing-Hall-Room structure), Chapters 16-18 (normalization-entity detection-chunking pipeline), Chapters 14-15 (memory layers and hybrid search).
Scenario
User Alex has a project directory ~/projects/my_app containing frontend code (components/), backend code (api/), documentation (docs/), and several Claude Code conversation export files. He wants to:
- Have MemPalace automatically identify the project structure and generate Room classifications
- Ingest both project files and conversation records into the memory palace
- Search for "why did we switch to GraphQL" and find the decision conversation from that time
The entire flow involves three commands: mempalace init, mempalace mine, mempalace search.
Sequence Diagram
sequenceDiagram
participant U as User (CLI)
participant CLI as cli.py
participant ED as entity_detector.py
participant RD as room_detector_local.py
participant CFG as config.py
participant M as miner.py
participant CM as convo_miner.py
participant S as searcher.py
participant DB as ChromaDB
Note over U,DB: Phase 1: Initialization
U->>CLI: mempalace init ~/projects/my_app
CLI->>ED: scan_for_detection(dir)
ED-->>CLI: returns prose file list
CLI->>ED: detect_entities(files)
ED->>ED: extract_candidates() → extract capitalized proper nouns
ED->>ED: score_entity() → person/project scoring
ED->>ED: classify_entity() → classification + confidence
ED-->>CLI: {people, projects, uncertain}
CLI->>ED: confirm_entities(detected)
ED-->>CLI: confirmed entity list
CLI->>CLI: write entities.json
CLI->>RD: detect_rooms_local(dir)
RD->>RD: detect_rooms_from_folders()
RD->>RD: save_config() → write mempalace.yaml
RD-->>CLI: Room configuration complete
CLI->>CFG: MempalaceConfig().init()
CFG-->>CLI: write ~/.mempalace/config.json
Note over U,DB: Phase 2: Data Ingestion
U->>CLI: mempalace mine ~/projects/my_app
CLI->>M: mine(project_dir, palace_path)
M->>M: load_config() → read mempalace.yaml
M->>M: scan_project() → collect readable files
M->>DB: get_collection("mempalace_drawers")
loop Each file
M->>M: detect_room() → route to Room
M->>M: chunk_text() → 800-char chunks
M->>DB: add_drawer() → write to ChromaDB
end
U->>CLI: mempalace mine ~/chats/ --mode convos
CLI->>CM: mine_convos(convo_dir, palace_path)
CM->>CM: scan_convos() → collect conversation files
loop Each conversation file
CM->>CM: normalize() → format normalization
CM->>CM: chunk_exchanges() → chunk by exchange pairs
CM->>CM: detect_convo_room() → topic classification
CM->>DB: collection.add() → write to ChromaDB
end
Note over U,DB: Phase 3: Search
U->>CLI: mempalace search "why did we switch to GraphQL"
CLI->>S: search(query, palace_path)
S->>DB: PersistentClient(path)
S->>DB: col.query(query_texts, n_results)
DB-->>S: {documents, metadatas, distances}
S->>S: compute similarity = 1 - distance
S-->>U: formatted output
Phase 1: Initialization (mempalace init)
Initialization is the most critical step in the entire system — it determines how data is organized, and that organization is baked in at write time. Initialization does two things: detect entities (who and what) and detect Rooms (how to classify).
1.1 Entry Point: CLI Parsing
The user runs mempalace init ~/projects/my_app, and argparse passes the dir argument to the cmd_init function (cli.py:37). This function is the orchestrator for the entire initialization flow, calling the entity detection and Room detection subsystems in sequence.
1.2 Entity Detection: Pass 1 — File Scanning
cmd_init first calls scan_for_detection(args.dir) (cli.py:45). This function is defined at entity_detector.py:813, and its job is to collect files suitable for entity detection.
Key design decision: prioritize prose files. PROSE_EXTENSIONS (entity_detector.py:400-405) includes only .txt, .md, .rst, .csv, because capitalized identifiers in code files (class names, function names) would produce a large number of false positives. Only when fewer than 3 prose files are found does it fall back to including code files (entity_detector.py:834). Each scan processes at most 10 files (max_files parameter), reading only the first 5KB of each file (entity_detector.py:652) — this is not laziness, but because if an entity is important enough to remember, it will appear repeatedly near the beginning of files.
1.3 Entity Detection: Pass 2 — Extraction and Scoring
detect_entities(files) (entity_detector.py:632) executes a three-step pipeline:
Candidate extraction: extract_candidates() (entity_detector.py:443) uses the regex r"\b([A-Z][a-z]{1,19})\b" to find all capitalized words, filters out a stopword list (STOPWORDS, approximately 200 common English words, entity_detector.py:92-396), and retains only words appearing 3 or more times. It also uses r"\b([A-Z][a-z]+(?:\s+[A-Z][a-z]+)+)\b" to extract multi-word proper nouns (e.g., "Memory Palace", "Claude Code").
Signal scoring: For each candidate, score_entity() (entity_detector.py:486) scores using two sets of regex patterns:
- Person signals (
PERSON_VERB_PATTERNS,entity_detector.py:27-48): action patterns like{name} said,{name} asked,hey {name}. Dialogue markers (DIALOGUE_PATTERNS) carry the highest weight, with each match scoring +3. Pronoun proximity detection checks whethershe/he/theyand other pronouns appear within 3 lines before or after the name. - Project signals (
PROJECT_VERB_PATTERNS,entity_detector.py:72-89): technical patterns likebuilding {name},import {name},{name}.py. Version number markers and code references carry the highest weight, with each match scoring +3.
Classification: classify_entity() (entity_detector.py:562) makes a judgment based on the person/project score ratio. An important safeguard: even if the person score exceeds 70%, if only one signal type is present (e.g., only pronoun matches), the entity is downgraded to "uncertain" (entity_detector.py:605-609). This prevents words like "Click" from being misclassified as a person name due to frequent occurrence in Click said patterns.
1.4 Entity Confirmation and Saving
confirm_entities() (entity_detector.py:717) lets the user interactively review detection results. If the --yes flag is passed, all detected people and projects are automatically accepted, and uncertain items are skipped (entity_detector.py:739-744). Confirmed entities are saved as entities.json (cli.py:54-56) for use by the subsequent miner.
1.5 Room Detection
After entity detection completes, cmd_init calls detect_rooms_local(project_dir) (cli.py:62). This function is defined at room_detector_local.py:270 and performs the following steps:
Folder scanning: detect_rooms_from_folders() (room_detector_local.py:97) traverses the project's top-level and second-level directories, matching folder names against FOLDER_ROOM_MAP (room_detector_local.py:20-94). This mapping table covers 40+ common folder naming conventions: frontend/client/ui/components all map to the "frontend" Room, backend/server/api/routes/models all map to the "backend" Room.
For our scenario, components/ would match to "frontend", api/ to "backend", and docs/ to "documentation". If a top-level folder is not in the mapping table but looks like a valid name (length > 2, starts with a letter), it is used directly as the Room name (room_detector_local.py:128-130).
Fallback strategy: If the folder structure produces only a single "general" Room, the system calls detect_rooms_from_files() (room_detector_local.py:168) to detect Rooms through keywords in filenames.
Configuration saving: save_config() (room_detector_local.py:255) writes the Wing name (derived from the directory name) and Room list to mempalace.yaml in the project directory. For ~/projects/my_app, the generated configuration looks roughly like:
wing: my_app
rooms:
- name: frontend
description: Files from components/
- name: backend
description: Files from api/
- name: documentation
description: Files from docs/
- name: general
description: Files that don't fit other rooms
1.6 Global Configuration
Finally, MempalaceConfig().init() (cli.py:63) creates the global configuration file config.json under ~/.mempalace/ (config.py:126-138), containing the palace path (default ~/.mempalace/palace), collection name (mempalace_drawers), topic Wing list, and keyword mappings.
Initialization Output Summary
| File | Location | Purpose |
|---|---|---|
entities.json | ~/projects/my_app/ | Confirmed list of people and projects |
mempalace.yaml | ~/projects/my_app/ | Wing name + Room definitions |
config.json | ~/.mempalace/ | Global configuration (palace path, etc.) |
Phase 2: Data Ingestion (mempalace mine)
2.1 Project File Ingestion
Running mempalace mine ~/projects/my_app, cmd_mine (cli.py:66) calls mine() (miner.py:315) in the default projects mode.
Configuration loading: load_config() (miner.py:66) reads mempalace.yaml from the project directory, obtaining the Wing name and Room list.
File scanning: scan_project() (miner.py:287) recursively traverses the project directory, collecting the 20 file types defined in READABLE_EXTENSIONS (miner.py:19-40) (.py, .js, .ts, .md, etc.), skipping directories listed in SKIP_DIRS (miner.py:42-54) (.git, node_modules, __pycache__, etc.).
Obtaining the ChromaDB collection: get_collection() (miner.py:183) connects to ChromaDB in PersistentClient mode and creates or retrieves the collection named mempalace_drawers. ChromaDB uses the default all-MiniLM-L6-v2 model to automatically generate embedding vectors — no API key required.
Per-file processing: process_file() (miner.py:233) executes a three-step pipeline for each file:
- Deduplication check:
file_already_mined()(miner.py:192) queries ChromaDB, and if the source file has already been ingested, skips it immediately. - Room routing:
detect_room()(miner.py:89) matches by priority: folder path -> filename -> content keywords -> fallback to "general". Forcomponents/Header.tsx, the first priority (path contains "components", matching the "frontend" Room) hits immediately. - Chunking:
chunk_text()(miner.py:135) splits file content into 800-character chunks (CHUNK_SIZE,miner.py:56) with 100-character overlap between adjacent chunks (CHUNK_OVERLAP,miner.py:57). Split points prefer paragraph boundaries (\n\n), followed by line boundaries (\n), ensuring no breaks in the middle of sentences. The minimum chunk size is 50 characters (MIN_CHUNK_SIZE,miner.py:58); shorter fragments are discarded.
Writing to ChromaDB: add_drawer() (miner.py:201) generates a unique ID for each chunk (drawer_{wing}_{room}_{md5_hash}) and writes it along with six metadata fields:
{
"wing": wing, # project name
"room": room, # Room name
"source_file": source, # original file path
"chunk_index": index, # chunk sequence number within the file
"added_by": agent, # ingestion agent
"filed_at": timestamp, # ISO 8601 timestamp
}
These metadata fields are determined at write time — the filtering capabilities available during search depend entirely on the completeness of annotations at this stage.
2.2 Conversation File Ingestion
Running mempalace mine ~/chats/ --mode convos, cmd_mine (cli.py:69) calls mine_convos() (convo_miner.py:252). Conversation ingestion shares the same ChromaDB collection as project ingestion, but uses a different chunking strategy.
Format normalization: Each file first passes through normalize() (convo_miner.py:302), which unifies different formats from Claude Code, ChatGPT, Slack, and others into a canonical format of > user question\n AI answer.
Exchange-pair chunking: chunk_exchanges() (convo_miner.py:52) detects the number of > markers in the file. If more than 3 > markers are found, it identifies the file as conversation format and calls _chunk_by_exchange() (convo_miner.py:66) — a user question (> line) plus the immediately following AI answer forms an indivisible chunk. If the file is not in conversation format, it falls back to _chunk_by_paragraph() (convo_miner.py:102) for paragraph-based chunking.
Topic Room detection: detect_convo_room() (convo_miner.py:194) scores the first 3000 characters of the content against TOPIC_KEYWORDS (convo_miner.py:127-191). Five topic Rooms — technical, architecture, planning, decisions, problems — each have 10-13 keywords. The highest-scoring topic becomes the Room for that conversation. For a GraphQL discussion containing "switched", "chose", and "alternative", the "decisions" Room would receive the highest score.
Additional metadata at write time: Conversation chunks include two extra metadata fields beyond project chunks — "ingest_mode": "convos" and "extract_mode": "exchange" (convo_miner.py:368-369). This allows distinguishing between project knowledge and conversation memories during search.
Phase 3: Search (mempalace search)
3.1 Search Entry Point
Running mempalace search "why did we switch to GraphQL", cmd_search (cli.py:94) calls search() (searcher.py:15).
3.2 Connecting to the Palace
search() first connects to ChromaDB using PersistentClient and obtains the mempalace_drawers collection (searcher.py:21-22). If the palace does not exist, it immediately returns an error and suggests the user run init and mine first (searcher.py:24-26).
3.3 Building Filters
If the user specifies --wing or --room, search() builds a ChromaDB where filter (searcher.py:29-35). When both are specified, it uses an $and compound query:
where = {"$and": [{"wing": wing}, {"room": room}]}
This is the core of MemPalace's hybrid search: first use metadata for exact filtering (narrowing the search scope), then perform vector similarity search on the filtered subset. For the query "why did we switch to GraphQL", if the user adds --wing my_app --room decisions, the search space might shrink from thousands of Drawers to a few dozen.
3.4 Executing the Query
col.query() (searcher.py:46) vectorizes the query text (ChromaDB internally uses the all-MiniLM-L6-v2 model), then finds the n_results nearest neighbor vectors in the collection. It returns three parallel arrays: documents (original text), metadatas (metadata), and distances (vector distances).
3.5 Result Formatting
searcher.py:68-83 formats the output. Similarity is calculated as 1 - distance (searcher.py:69), since ChromaDB uses cosine distance by default. The output includes:
[1] my_app / decisions
Source: graphql-migration.md
Match: 0.847
> Why did we switch from REST to GraphQL?
We discussed this on Tuesday. The main reasons were...
Each result is annotated with Wing (which project it came from), Room (which classification it belongs to), source (original filename), and match (similarity score), along with the full original text — not a summary, not a paraphrase, but the exact words the user wrote at the time.
3.6 Programmatic Search Interface
Beyond CLI output, search_memories() (searcher.py:87) provides a dictionary-returning programmatic interface for use by the MCP server and other callers. Return format:
{
"query": "why did we switch to GraphQL",
"filters": {"wing": "my_app", "room": "decisions"},
"results": [
{
"text": "original text...",
"wing": "my_app",
"room": "decisions",
"source_file": "graphql-migration.md",
"similarity": 0.847,
}
],
}
What This Trace Reveals
Looking back at the complete data flow, three design principles emerge:
Structure before content. The init phase determines the palace's skeleton — Wing name, Room list, entity mappings — before any data is ingested. This is not engineering laziness but a deliberate choice: if you do not know who and what exists in your world, no amount of vector embeddings will help you. The two signal pattern categories in entity detection (person verbs vs. project verbs) and the dual-signal safeguard mechanism (entity_detector.py:601) are all designed to get classification right before data enters the palace.
Metadata is determined at write time. Each Drawer's Wing, Room, and source_file are fixed at the moment add_drawer() is called. The filtering capabilities available during search derive entirely from the annotation quality at write time. This means that if the routing logic in detect_room() is wrong, the error is permanently preserved — but it also means that search requires no additional classification computation and is extremely fast. The six metadata fields in miner.py:201-225 represent the upper bound of MemPalace's search capabilities.
Same palace, different ingestion strategies. Project files and conversation files are both written to the same mempalace_drawers collection (miner.py:188, convo_miner.py:217), but their chunking logic is entirely different: project files are split by a fixed 800-character window (miner.py:56), while conversation files are split at semantic boundaries defined by exchange pairs (convo_miner.py:66). Both strategies serve the same type of query — when you search for "why did we switch to GraphQL", results from code comments and conversation memories appear side by side, ranked by a unified similarity score. This is the practical effect of the Wing-Hall-Room structure from Chapter 5: structure provides classification, vectors provide association, and the two do not interfere with each other.
Appendix B: E2E Trace — MCP Tool Call Lifecycle
This appendix traces a real AI-to-MemPalace interaction, from user question to final answer, showing the data flow under the MCP protocol frame by frame. Related chapters: Chapter 19 (MCP server), Chapter 11 (knowledge graph), Chapter 9 (AAAK).
Scenario
The user opens Claude Code and types what seems like a simple question:
"Why did we decide to use Clerk last month?"
This question implicitly contains three layers of requirements: time filtering ("last month"), decision tracing ("why"), and entity identification ("Clerk"). Claude cannot guess from thin air — it needs to check the palace. Over the next few hundred milliseconds, the MCP protocol will drive four phases of tool calls, each with precise input/output boundaries.
Sequence Diagram: Complete Lifecycle
sequenceDiagram
participant U as User
participant C as Claude
participant MCP as MCP Server
participant S as Searcher
participant KG as KnowledgeGraph
participant DB as ChromaDB
participant SQL as SQLite
U->>C: "Why did we decide to use Clerk last month?"
Note over C: Phase 1: Session Startup
C->>MCP: tools/call: mempalace_status
MCP->>DB: _get_collection()
DB-->>MCP: collection
MCP->>DB: col.get(include=["metadatas"])
DB-->>MCP: all_meta
MCP-->>C: {total_drawers, wings, rooms, protocol, aaak_dialect}
Note over C: Phase 2: Semantic Search
C->>MCP: tools/call: mempalace_search<br/>{query: "reasons for choosing Clerk", wing: "wing_myproject"}
MCP->>S: search_memories()
S->>DB: col.query(query_texts, where, n_results)
DB-->>S: {documents, metadatas, distances}
S-->>MCP: {query, filters, results: [...]}
MCP-->>C: search results (with original text + similarity)
Note over C: Phase 3: Knowledge Graph Query
C->>MCP: tools/call: mempalace_kg_query<br/>{entity: "Clerk", direction: "both"}
MCP->>KG: query_entity("Clerk")
KG->>SQL: SELECT ... WHERE subject=? OR object=?
SQL-->>KG: rows
KG-->>MCP: facts[]
MCP-->>C: {entity, facts, count}
Note over C: Phase 4: Writing New Discoveries
C->>MCP: tools/call: mempalace_add_drawer<br/>{wing, room, content}
MCP->>MCP: tool_check_duplicate(content, 0.9)
MCP->>DB: col.query(query_texts, n_results=5)
DB-->>MCP: no duplicates
MCP->>DB: col.add(ids, documents, metadatas)
DB-->>MCP: write successful
MCP-->>C: {success: true, drawer_id}
C-->>U: synthesized answer + cited sources
Phase 1: Session Startup (mempalace_status)
Every time Claude interacts with MemPalace, the first call is always mempalace_status. This is not optional — the first rule of PALACE_PROTOCOL states:
"ON WAKE-UP: Call mempalace_status to load palace overview + AAAK spec."
(mcp_server.py:94)
Request
The MCP client sends a JSON-RPC request:
{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": {
"name": "mempalace_status",
"arguments": {}
}
}
This request is routed by handle_request() to the tools/call branch (mcp_server.py:719), then dispatched through the TOOLS dictionary to the tool_status() function (mcp_server.py:63).
Execution Path
tool_status() first calls _get_collection() (mcp_server.py:41-49) to obtain the ChromaDB collection. If the collection does not exist — for example, if the user has never run mempalace init — the function returns _no_palace() (mcp_server.py:52-57), providing clear remediation guidance:
def _no_palace():
return {
"error": "No palace found",
"palace_path": _config.palace_path,
"hint": "Run: mempalace init <dir> && mempalace mine <dir>",
}
When the palace exists, tool_status() iterates over all metadata, tallying Wing and Room distributions (mcp_server.py:70-78), then returns a dictionary with six fields.
Triple Payload
The return value of tool_status() is more than just statistics. It carries a triple payload (mcp_server.py:79-86):
return {
"total_drawers": count, # Payload 1: Palace overview
"wings": wings,
"rooms": rooms,
"palace_path": _config.palace_path,
"protocol": PALACE_PROTOCOL, # Payload 2: Behavioral protocol
"aaak_dialect": AAAK_SPEC, # Payload 3: AAAK specification
}
Payload 1: Palace overview. total_drawers, wings, rooms tell the AI how large this palace is and what domains it covers. The AI uses this to decide which Wing to constrain subsequent searches to.
Payload 2: Behavioral protocol. PALACE_PROTOCOL (mcp_server.py:93-100) consists of five behavioral rules, the most critical being the second and third:
"BEFORE RESPONDING about any person, project, or past event: call mempalace_kg_query or mempalace_search FIRST. Never guess — verify."
"IF UNSURE about a fact: say 'let me check' and query the palace. Wrong is worse than slow."
These two rules switch the AI from "generation mode" to "verification mode". Without this protocol, the AI would simply fabricate answers.
Payload 3: AAAK specification. AAAK_SPEC (mcp_server.py:102-119) teaches the AI how to read and write the AAAK compression format — entity codes (ALC=Alice), sentiment markers (*warm*=joy), structural delimiters (pipe-separated fields). This means the AI can directly understand AAAK-formatted text returned by subsequent searches.
The design intent behind this triple payload is: in a single call, the AI acquires all the context needed to operate the palace. No configuration files to read, no additional initialization steps required.
Phase 2: Search (mempalace_search)
After receiving the status, Claude parses the user's question and constructs a search request. It knows "Clerk" is most likely related to project decisions, so it constrains the search to wing_myproject.
Request
{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/call",
"params": {
"name": "mempalace_search",
"arguments": {
"query": "reasons for choosing Clerk",
"wing": "wing_myproject",
"limit": 5
}
}
}
Execution Path
tool_search() (mcp_server.py:173-180) is a thin proxy — it passes parameters directly through to search_memories():
def tool_search(query: str, limit: int = 5, wing: str = None, room: str = None):
return search_memories(
query,
palace_path=_config.palace_path,
wing=wing,
room=room,
n_results=limit,
)
The actual search logic resides in the search_memories() function at searcher.py:87-142. This function does three things:
Step 1: Build the where filter (searcher.py:100-107). When both Wing and Room are specified, it uses ChromaDB's $and compound condition; when only one is specified, it passes a single condition directly. This branching logic may appear simple, but it determines whether the search spans the entire palace or is constrained to a specific area:
where = {}
if wing and room:
where = {"$and": [{"wing": wing}, {"room": room}]}
elif wing:
where = {"wing": wing}
elif room:
where = {"room": room}
Step 2: Execute the semantic query (searcher.py:109-118). When calling col.query(), it passes query_texts (the vectorized query), n_results (number of results to return), include (fields to return), and the optional where filter. ChromaDB internally embeds the query text as a vector, computes cosine distances against all Drawer vectors, and returns the N nearest results.
Step 3: Assemble the return value (searcher.py:126-142). Each search result contains five fields: text (original text), wing, room, source_file (source filename), and similarity (similarity score, computed as 1 - distance).
hits.append({
"text": doc,
"wing": meta.get("wing", "unknown"),
"room": meta.get("room", "unknown"),
"source_file": Path(meta.get("source_file", "?")).name,
"similarity": round(1 - dist, 3),
})
Example Return Value
{
"query": "reasons for choosing Clerk",
"filters": {"wing": "wing_myproject", "room": null},
"results": [
{
"text": "AUTH.DECISION:2026-03|chose.Clerk→Auth0.rejected|*pragmatic*...",
"wing": "wing_myproject",
"room": "decisions",
"source_file": "meeting-2026-03-12.md",
"similarity": 0.847
}
]
}
Note that the returned text is in AAAK format. Because Phase 1 already loaded the AAAK specification into Claude's context, Claude can directly expand AUTH.DECISION:2026-03|chose.Clerk→Auth0.rejected into natural language.
Phase 3: Knowledge Graph Query (mempalace_kg_query)
The search returned the original record of "choosing Clerk", but Claude wants to know more: what is Clerk's relationship to other project components? Did it replace a previous solution? These kinds of relationship queries are exactly where the knowledge graph excels.
Request
{
"jsonrpc": "2.0",
"id": 3,
"method": "tools/call",
"params": {
"name": "mempalace_kg_query",
"arguments": {
"entity": "Clerk",
"direction": "both"
}
}
}
Execution Path
tool_kg_query() (mcp_server.py:309-312) forwards the request to KnowledgeGraph.query_entity():
def tool_kg_query(entity: str, as_of: str = None, direction: str = "both"):
results = _kg.query_entity(entity, as_of=as_of, direction=direction)
return {"entity": entity, "as_of": as_of, "facts": results, "count": len(results)}
query_entity() (knowledge_graph.py:186-241) is the core query method of the knowledge graph. It accepts three parameters: name (entity name), as_of (time-point filter), and direction (query direction).
Entity ID normalization. First, _entity_id() (knowledge_graph.py:92-93) converts the entity name to lowercase-underscore format: "Clerk" becomes "clerk". This ensures case-insensitive matching.
Bidirectional query. When direction="both", the function executes two SQL queries. The first queries outgoing relationships (knowledge_graph.py:198-217) — "what does Clerk point to":
SELECT t.*, e.name as obj_name
FROM triples t JOIN entities e ON t.object = e.id
WHERE t.subject = ?
The second queries incoming relationships (knowledge_graph.py:219-238) — "what points to Clerk":
SELECT t.*, e.name as sub_name
FROM triples t JOIN entities e ON t.subject = e.id
WHERE t.object = ?
Time filtering. If an as_of parameter is provided, both queries append a time window condition (knowledge_graph.py:201-203):
AND (t.valid_from IS NULL OR t.valid_from <= ?)
AND (t.valid_to IS NULL OR t.valid_to >= ?)
This means only facts that are still valid at the as_of point in time are returned. Facts that have been marked expired by invalidate() (valid_to is not NULL) are automatically excluded.
Return Value Structure
Each fact includes full timestamps and validity markers:
{
"entity": "Clerk",
"as_of": null,
"facts": [
{
"direction": "outgoing",
"subject": "Clerk",
"predicate": "replaces",
"object": "Auth0",
"valid_from": "2026-03-12",
"valid_to": null,
"confidence": 1.0,
"source_closet": "drawer_wing_myproject_decisions_a3f2...",
"current": true
},
{
"direction": "incoming",
"subject": "MyProject",
"predicate": "uses",
"object": "Clerk",
"valid_from": "2026-03-12",
"valid_to": null,
"confidence": 1.0,
"source_closet": null,
"current": true
}
],
"count": 2
}
current: true indicates this fact is still valid (valid_to is NULL). If source_closet is present, it points to the original source of the knowledge graph fact — a Drawer ID in ChromaDB, forming a back-link from the knowledge graph to the vector store.
Phase 4: Memory Writing (mempalace_add_drawer)
Claude synthesized the search results and knowledge graph query to answer the user. The conversation may have produced new information — for example, the user adds "Right, Auth0's pricing went up 40% at that time, which was also a factor." Claude decides to store this new discovery in the palace.
Request
{
"jsonrpc": "2.0",
"id": 4,
"method": "tools/call",
"params": {
"name": "mempalace_add_drawer",
"arguments": {
"wing": "wing_myproject",
"room": "decisions",
"content": "AUTH.COST:2026-03|Auth0.price↑40%→triggered.Clerk.eval|★★★",
"added_by": "mcp"
}
}
}
Execution Path
The first thing tool_add_drawer() (mcp_server.py:250-287) does is not write — it checks for duplicates.
Idempotency protection. The function calls tool_check_duplicate(content, threshold=0.9) at line 259. tool_check_duplicate() (mcp_server.py:183-215) performs a semantic search on the incoming content, returning any existing Drawers with similarity at or above 0.9. If a duplicate is found, the write is aborted immediately:
dup = tool_check_duplicate(content, threshold=0.9)
if dup.get("is_duplicate"):
return {
"success": False,
"reason": "duplicate",
"matches": dup["matches"],
}
This design addresses a practical problem: an AI may repeatedly attempt to write the same content across multiple conversation turns, or two agents in the same session may independently discover the same fact. The 0.9 threshold allows minor variations in wording but blocks semantic duplicates.
ID generation. A unique ID is generated via an MD5 hash of the first 100 characters of content plus the current timestamp (mcp_server.py:267):
drawer_id = f"drawer_{wing}_{room}_{hashlib.md5(
(content[:100] + datetime.now().isoformat()).encode()
).hexdigest()[:16]}"
The ID format drawer_{wing}_{room}_{hash} is self-describing — the Wing and Room that a Drawer belongs to can be determined from its ID alone.
Writing to ChromaDB. The final step calls col.add() (mcp_server.py:270-284), writing the document (original text) and metadata (wing, room, source_file, chunk_index, added_by, filed_at). ChromaDB automatically computes the document's embedding vector at write time, making it discoverable by subsequent searches.
Metadata Design
The metadata written (mcp_server.py:276-283) is worth noting:
{
"wing": wing,
"room": room,
"source_file": source_file or "",
"chunk_index": 0,
"added_by": added_by, # "mcp" indicates AI-written
"filed_at": datetime.now().isoformat(),
}
The added_by field distinguishes the source of memories: "mcp" indicates the AI wrote it via MCP, while "mine" indicates it was extracted from files by the mempalace mine command. filed_at is the write time, not the event time — the event time is encoded in the AAAK content itself (e.g., AUTH.COST:2026-03).
What This Trace Reveals
The Triple Payload of Status Is Deliberate Design
Embedding the behavioral protocol and AAAK specification in the return value of tool_status(), rather than as separate tools, is an architectural decision. It leverages a behavioral characteristic of AI: the AI reads the entire content returned by a tool. By "piggybacking" the rules on the status query, it ensures the AI knows which protocol to follow before performing any operations. If the protocol and specification were split into separate tools, the AI might forget to call them.
However, this also implies a precondition: the palace must already be initialized. When _get_collection() returns None, tool_status() returns _no_palace() (mcp_server.py:64-65), which does not include the protocol and aaak_dialect fields. In other words, an AI without a palace does not receive the behavioral protocol — this serves as both a defense (preventing the protocol from acting on an empty palace) and a prompt (the AI sees the hint field and guides the user to initialize).
Search Metadata Filtering Is a Layered Architecture
The where filter construction logic in searcher.py (searcher.py:100-107) implements three levels of search granularity: full-palace search (no Wing/Room specified), Wing-level search (only Wing specified), and Room-level search (both Wing and Room specified). This corresponds to the spatial metaphor of the memory palace — you can search the entire palace, or only within a specific Room of a specific Wing.
Semantic search operates in ChromaDB's vector space; metadata filtering operates in the SQLite index. Their combination means that even if the user's query semantically matches content across multiple Wings, the Wing filter ensures only results from the relevant domain are returned.
Write Idempotency Compensates for AI Behavior
A known issue with AI is repetitive operations — across multiple conversation turns, it may forget that it has already written a particular memory and attempt to write it again. The tool_check_duplicate() call in tool_add_drawer() (mcp_server.py:259) is precisely an engineering compensation for this problem. The 0.9 similarity threshold is an empirical value: high enough to allow wording variants ("Auth0 raised prices" vs. "Auth0's pricing increased"), yet strict enough to catch substantive duplicates.
Note that deduplication and writing use the same ChromaDB collection, and deduplication itself is a vector search. This means the cost of deduplication is comparable to a regular search — at the typical scale of a personal knowledge base (thousands to tens of thousands of records), this overhead is negligible.
Transparency of the MCP Protocol Layer
Throughout the entire interaction chain, handle_request() (mcp_server.py:691-743) plays an extremely simple role: parse JSON-RPC, route to the corresponding handler, and wrap the return value as a JSON-RPC response. It performs no business logic. All tool handlers are ordinary Python functions that accept basic-type parameters and return dictionaries. This means these functions can be called directly (e.g., in tests), without depending on the MCP protocol layer.
The transparency of the MCP protocol layer is also reflected in error handling: if a handler throws an exception, handle_request() catches it and returns a JSON-RPC error object (mcp_server.py:735-737), containing error code -32000 and the exception message. Upon receiving this error, the AI can decide whether to retry, try a different query approach, or simply tell the user something went wrong.
These four phases — startup, search, graph query, write — constitute the basic cycle of MemPalace MCP interactions. Not every conversation will go through all phases (sometimes search alone is sufficient, with no need for graph queries or writes), but the Phase 1 status call is the mandatory starting point. The triple payload it loads determines all of the AI's behavioral boundaries for that session.
Appendix C: AAAK Dialect Complete Reference
This appendix consolidates the
AAAK_SPECconstant frommcp_server.pyand the complete encoding tables fromdialect.py, providing a searchable reference for the AAAK dialect. Source baseline: the current MemPalace source snapshot discussed in this book.
Overview
AAAK is a compressed shorthand format designed for AI agents. It is not meant for human reading -- it is meant for LLMs. Any model that can read English (Claude, GPT, Gemini, Llama, Mistral) can understand AAAK directly, without a decoder or fine-tuning.
Format Structure
Line Types
| Prefix | Meaning | Format |
|---|---|---|
0: | Header line | FILE_NUM|PRIMARY_ENTITY|DATE|TITLE |
Z + number | Zettel entry | ZID:ENTITIES|topic_keywords|"key_quote"|WEIGHT|EMOTIONS|FLAGS |
T: | Tunnel (cross-entry link) | T:ZID<->ZID|label |
ARC: | Emotion arc | ARC:emotion->emotion->emotion |
Field Separators
- Pipe
|separates different fields within the same line - Arrow
→denotes causal or transformational relationships - Stars
★to★★★★★indicate importance (1--5 scale)
Entity Encoding
Entity names are encoded as the first three letters in uppercase:
| Name | Code | Rule |
|---|---|---|
| Alice | ALC | name[:3].upper() |
| Jordan | JOR | |
| Riley | RIL | |
| Max | MAX | |
| Ben | BEN | |
| Priya | PRI | |
| Kai | KAI | |
| Soren | SOR |
Source location: dialect.py:367-379 (encode_entity method)
Emotion Encoding Table
AAAK uses standardized short codes to represent emotional states.
Core Emotion Codes
| English | Code | Meaning |
|---|---|---|
| vulnerability | vul | Vulnerability |
| joy | joy | Joy |
| fear | fear | Fear |
| trust | trust | Trust |
| grief | grief | Grief |
| wonder | wonder | Wonder |
| rage | rage | Rage |
| love | love | Love |
| hope | hope | Hope |
| despair | despair | Despair |
| peace | peace | Peace |
| humor | humor | Humor |
| tenderness | tender | Tenderness |
| raw_honesty | raw | Raw honesty |
| self_doubt | doubt | Self-doubt |
| relief | relief | Relief |
| anxiety | anx | Anxiety |
| exhaustion | exhaust | Exhaustion |
| conviction | convict | Conviction |
| quiet_passion | passion | Quiet passion |
| warmth | warmth | Warmth |
| curiosity | curious | Curiosity |
| gratitude | grat | Gratitude |
| frustration | frust | Frustration |
| confusion | confuse | Confusion |
| satisfaction | satis | Satisfaction |
| excitement | excite | Excitement |
| determination | determ | Determination |
| surprise | surprise | Surprise |
Source location: dialect.py:47-88 (EMOTION_CODES dictionary)
Shorthand Markers in the MCP Server
The AAAK_SPEC in mcp_server.py uses *marker* format to annotate emotional context:
| Marker | Meaning |
|---|---|
*warm* | Warmth / Joy |
*fierce* | Determination / Resolve |
*raw* | Vulnerability / Raw honesty |
*bloom* | Tenderness / Blossoming |
Emotion Signal Detection
dialect.py automatically detects emotions in text via keyword matching:
| Keyword | Mapped Code |
|---|---|
| decided | determ |
| prefer | convict |
| worried | anx |
| excited | excite |
| frustrated | frust |
| confused | confuse |
| love | love |
| hate | rage |
| hope | hope |
| fear | fear |
| happy | joy |
| sad | grief |
| surprised | surprise |
| grateful | grat |
| curious | curious |
| anxious | anx |
| relieved | relief |
| concern | anx |
Source location: dialect.py:91-114 (_EMOTION_SIGNALS dictionary)
Semantic Flags
Flags mark the type of factual assertion, aiding retrieval and classification.
| Flag | Meaning | Trigger Keywords |
|---|---|---|
DECISION | Explicit decision or choice | decided, chose, switched, migrated, replaced, instead of, because |
ORIGIN | Origin moment | founded, created, started, born, launched, first time |
CORE | Core belief or identity pillar | core, fundamental, essential, principle, belief, always, never forget |
PIVOT | Emotional turning point | turning point, changed everything, realized, breakthrough, epiphany |
TECHNICAL | Technical architecture or implementation detail | api, database, architecture, deploy, infrastructure, algorithm, framework, server, config |
SENSITIVE | Content requiring careful handling | (manually annotated) |
GENESIS | Directly led to the creation of something that still exists | (inferred from context) |
Source location: dialect.py:117-152 (_FLAG_SIGNALS dictionary)
Palace Structure Identifiers
| Element | Format | Example |
|---|---|---|
| Wing | wing_ + name | wing_user, wing_code, wing_myproject |
| Hall | hall_ + type | hall_facts, hall_events, hall_discoveries, hall_preferences, hall_advice |
| Room | Hyphenated slug | chromadb-setup, gpu-pricing, auth-migration |
Full Example
Original English (~70 tokens)
Priya manages the Driftwood team: Kai (backend, 3 years), Soren (frontend),
Maya (infrastructure), and Leo (junior, started last month). They're building
a SaaS analytics platform. Current sprint: auth migration to Clerk.
Kai recommended Clerk over Auth0 based on pricing and DX.
AAAK Encoding (~35 tokens)
TEAM: PRI(lead) | KAI(backend,3yr) SOR(frontend) MAY(infra) LEO(junior,new)
PROJ: DRIFTWOOD(saas.analytics) | SPRINT: auth.migration→clerk
DECISION: KAI.rec:clerk>auth0(pricing+dx) | ★★★★
Factual Assertion Verification
| # | Assertion | AAAK Counterpart | Preserved |
|---|---|---|---|
| 1 | Priya is the team lead | PRI(lead) | Yes |
| 2 | Kai does backend | KAI(backend,3yr) | Yes |
| 3 | Kai has 3 years of experience | KAI(backend,3yr) | Yes |
| 4 | Soren does frontend | SOR(frontend) | Yes |
| 5 | Maya does infrastructure | MAY(infra) | Yes |
| 6 | Leo is a junior engineer | LEO(junior,new) | Yes |
| 7 | Leo started last month | LEO(junior,new) | Yes |
| 8 | The project is called Driftwood | DRIFTWOOD | Yes |
| 9 | It is a SaaS analytics platform | saas.analytics | Yes |
| 10 | Current sprint is auth migration | SPRINT: auth.migration→clerk | Yes |
| 11 | Migration target is Clerk | →clerk | Yes |
| 12 | Kai recommended Clerk | KAI.rec:clerk | Yes |
| 13 | Reasons are pricing and developer experience | pricing+dx | Yes |
For this short structured example, all 13/13 factual assertions are preserved. Compression ratio ~2x (this example is short and information-dense).
AAAK_SPEC in the MCP Server
The following is the complete specification passed to the AI via the mempalace_status tool, found at mcp_server.py:102-119:
AAAK is a compressed memory dialect that MemPalace uses for efficient storage.
It is designed to be readable by both humans and LLMs without decoding.
FORMAT:
ENTITIES: 3-letter uppercase codes. ALC=Alice, JOR=Jordan, RIL=Riley, MAX=Max, BEN=Ben.
EMOTIONS: *action markers* before/during text. *warm*=joy, *fierce*=determined,
*raw*=vulnerable, *bloom*=tenderness.
STRUCTURE: Pipe-separated fields. FAM: family | PROJ: projects | ⚠: warnings/reminders.
DATES: ISO format (2026-03-31). COUNTS: Nx = N mentions (e.g., 570x).
IMPORTANCE: ★ to ★★★★★ (1-5 scale).
HALLS: hall_facts, hall_events, hall_discoveries, hall_preferences, hall_advice.
WINGS: wing_user, wing_agent, wing_team, wing_code, wing_myproject,
wing_hardware, wing_ue5, wing_ai_research.
ROOMS: Hyphenated slugs representing named ideas (e.g., chromadb-setup, gpu-pricing).
EXAMPLE:
FAM: ALC→♡JOR | 2D(kids): RIL(18,sports) MAX(11,chess+swimming) | BEN(contributor)
Read AAAK naturally — expand codes mentally, treat *markers* as emotional context.
When WRITING AAAK: use entity codes, mark emotions, keep structure tight.
Under the current protocol, the AI receives this specification when it explicitly calls mempalace_status and a palace already exists. This is not an out-of-band automatic injection path.
Compression Pipeline
The compress() method in dialect.py performs five-stage processing:
graph TD
A[Raw Text] --> B["1. Entity Detection<br/>name[:3].upper()"]
B --> C["2. Topic Extraction<br/>Remove stopwords + frequency sort"]
C --> D["3. Key Sentence Selection<br/>_extract_key_sentence()"]
D --> E["4. Emotion/Flag Detection<br/>Keyword → code mapping"]
E --> F["5. AAAK Assembly<br/>Pipe-separated + header line"]
For the current dialect.compress() plain-text path, a more accurate description is that all five stages contain heuristic selection, not only Stage 3. Entities, topics, emotions, and flags are all detected and truncated; key_sentence is simply the most obvious selection step. The current pipeline is closer to a high-compression index generator than to a strictly lossless encoder.
The "lossless AAAK" discussed elsewhere in the README and book is best understood as a design goal: truly preserving fact-by-fact structure would require a stronger alignment between the compressor and the original text than the current heuristic plain-text pipeline provides.
Source location: dialect.py:539-602 (compress method)
AAAK Dialect Completeness Assessment
Implemented Capabilities
| Capability | Source Location | Completeness |
|---|---|---|
| Entity encoding | encode_entity() :367-379 | Complete — name[:3].upper(), supports pre-defined mappings and auto-coding |
| Emotion encoding | EMOTION_CODES :47-88 | Complete — 28 emotions → short code mapping |
| Emotion detection | _EMOTION_SIGNALS :91-114 | Basic — 24 keyword triggers, simple in matching, no context |
| Flag detection | _FLAG_SIGNALS :117-152 | Basic — 7 flag types, 36 keywords, simple matching |
| Topic extraction | _extract_topics() :430-455 | Basic — word frequency + capitalization/camelCase weighting, top-3 |
| Key sentence | _extract_key_sentence() :457-508 | Basic — 18 decision words scored, short sentences weighted, truncated to 55 chars |
| Entity detection | _detect_entities_in_text() :510-537 | Basic — known entity matching + capitalized word fallback, top-3 |
| Compression assembly | compress() :539-602 | Complete — pipe-separated output format |
| Stop words | _STOP_WORDS :155-289 | Complete — ~135 English stop words |
| Config persistence | from_config() / save_config() | Complete |
| Zettel format | encode_zettel() / compress_file() | Complete — backward-compatible with legacy format |
| Layer1 generation | generate_layer1() | Complete — batch compression + aggregation |
| Compression stats | compression_stats() | Complete — original/compressed token counting |
Missing Critical Capabilities
As a "language," AAAK lacks key linguistic infrastructure:
| Missing | Impact | Severity |
|---|---|---|
| No formal grammar definition | No BNF/EBNF/PEG specification; "grammar" exists only in compress() code logic | High |
| No decoder/decompressor | Only encoding direction; no decompress() method to verify reversibility | High |
| No roundtrip tests | No assert decompress(compress(text)) ≈ text verification | High |
| No token-level precision | count_tokens() uses len(text)//3 estimation, not a real tokenizer | Medium |
| No multilingual support | Stop words, signal words, entity detection all hardcoded for English | Medium |
| No versioning | Encoding format has no version marker; cannot distinguish between different AAAK output versions | Medium |
| Truncation is irrecoverable | key_sentence truncated to 55 chars (:506-507), topics capped at top-3, emotions capped at top-3 — anything beyond is discarded | High |
Core Qualitative Judgment
AAAK is not a language — it is a compression function.
A true language requires three elements:
- Syntax — what constitutes a valid AAAK string. AAAK partially has this (pipe separation, header line format), but without formal definition.
- Semantics — the meaning definition of each symbol. AAAK has this (emotion code table has clear semantics).
- Roundtrip capability — information is preserved after encode→decode. AAAK completely lacks this.
compress() is a one-way function — it compresses text into AAAK format, but there is no corresponding decompress() to verify whether information was actually preserved. The README's "lossless" claim relies on "LLMs can read AAAK" — this delegates verification responsibility to the model's reasoning capability rather than the format's own reversibility guarantee.
In Fairness
- The design intuition is correct — "extremely abbreviated English, let the LLM be the decoder" genuinely works, because LLM language understanding can fill in omitted information.
- Engineering-sufficient — as a Closet-layer index (not the sole storage), AAAK does not need strict losslessness — Drawers preserve the originals.
- Cross-model readability is real — any English-capable model can indeed understand
KAI(backend,3yr); this property does not depend on AAAK's formal completeness. - 950 lines of code achieved usability — for a v3.0.0 project, this implementation sufficiently supports the benchmark results.
Overall Ratings
| Dimension | Score | Notes |
|---|---|---|
| Design concept | 8/10 | "LLM as decoder" is an original and effective insight |
| Implementation completeness | 5/10 | Encoder is complete, but lacks decoder and roundtrip verification |
| Formal language completeness | 3/10 | No BNF, no versioning, no formal semantics |
| Engineering utility | 7/10 | Sufficient as an index layer with Drawer as safety net |
| "30x lossless" claim | 3/10 | Over-promises — actually lossy index generation |
The most honest positioning: AAAK is an AI-oriented shorthand index format that enables any LLM to quickly understand context summaries through extremely abbreviated English, while relying on the Drawer layer to preserve complete original text as a safety net. Its core value lies not in "lossless compression" but in "cross-model-readable efficient indexing."
Appendix D: Authenticity and Credibility Assessment
Positioning: This appendix is not a technical design walkthrough. It is an evidence-based assessment of the current open-source MemPalace repository. The question it answers is not "is this a good idea," but "which capabilities are clearly supported by the code, which ones still live mainly in README / narrative form, and which claims cannot be confirmed from the local repository alone."
Scope of Assessment
This assessment is based on two things only:
- The local source snapshot referenced by this book
- Technical claims in the book that can be checked directly against that code
It does not attempt to judge the founders' motives, and it does not evaluate any closed-source components, private datasets, offline demos, or social-media presentation. In other words, this is not a moral judgment and not investment advice. It is an engineering credibility audit.
That boundary matters. A project can be two things at once:
- It can contain real, working engineering
- Its narrative can still run ahead of the current implementation
MemPalace is exactly that kind of project.
Three-Column Conclusion
| Category | Conclusion | Credibility |
|---|---|---|
| Core local ingest / store / retrieve pipeline | Real and traceable directly in the code | High |
| AAAK as a "strictly lossless, universal, ultra-high-compression" current implementation | Better understood as a design goal than as the exact state of the current plain-text compressor | Low to medium |
~170 token wake-up, Hall/Closet/agent automation narrative | README / roadmap content weighs more heavily than the default runtime path | Low |
| Benchmark scripts and result-reproduction pipeline | Real, but must be separated into raw / hybrid / rerank instead of treated as the default product path | Medium |
| "Completely offline, unplug the network immediately after install" | Core paths are largely local-first, but cold-start and asset-preparation boundaries still exist | Medium |
The table can be reduced to one sentence:
This is not a code-free shell project, but its promotional narrative has often run ahead of the implementation.
I. What Is Real
1. Ingestion and normalization are real
The normalize.py, miner.py, and convo_miner.py path is not decorative. The project really can convert multiple input formats into a common transcript / drawer representation and write them into a local vector store. This is not a repository that contains only benchmarks and no product code.
That means MemPalace has at least one solid chassis: local ingest -> chunk -> store -> search is real.
2. Retrieval and the MCP interface are real
searcher.py provides working semantic retrieval. mcp_server.py really exposes a read/write tool surface. Even if you reject the grand "memory palace" framing entirely, the project still remains a real local memory-storage + search + MCP-wrapper system.
3. Some memory-layer and auxiliary capabilities are real
layers.py, the knowledge graph, diary tools, duplicate checking, and taxonomy-related pieces are not PPT labels. They exist in the repository, they expose callable interfaces, and parts of them can be verified directly.
But "code exists" is not the same thing as "the narrative version is fully true." That leads to the next section.
II. Where the Narrative Clearly Runs Ahead
1. AAAK is described more strongly than the code supports
The book and README can easily leave the impression that AAAK is already a currently usable, fact-by-fact lossless compression language. But the current dialect.compress() plain-text path contains substantial heuristic selection:
- Only a few entities are kept
- Topics are top-k frequency outputs
- Emotions and flags are truncated
key_sentenceis itself an explicit selection step
That makes it much closer to a high-compression index generator than to a strict zero-loss encoder. So if AAAK is read as a design direction, I think it is credible. If it is read as a fully delivered current product capability in the open-source code, I do not.
2. ~170 token wake-up is not the default runtime path today
The current runtime wake-up path is still a longer multi-layer text assembly, not the ~170 token AAAK wake-up that appeared repeatedly in earlier README/book wording. That smaller number is closer to a target state described in the README than to the current CLI's default output.
This difference matters in practice because it changes how users reason about cost, latency, and local-model usability.
3. Hall / Closet / agent architecture is narrated more completely than implemented
In the story, MemPalace is often described as a richly layered, automatically routed cognitive architecture with specialist agents. In the current open-source implementation, the stable primary path is much closer to:
wingroom- drawers
- optional metadata and auxiliary tools
Hall, Closet, automatic routing, and built-in reviewer/architect/ops agent workflows often read more like design vocabulary, interface vision, or README worldview than like step-for-step default runtime reality.
III. What Can Only Be Rated Medium Confidence
1. Benchmark results are not the same thing as default product behavior
The repository really does contain benchmark scripts, and it really does contain raw / hybrid / rerank paths. The issue is that readers can easily misread "100% on a benchmark" as "the default product path is already 100%." That is not accurate.
A more rigorous reading is:
- raw retrieval is a real product capability
- hybrid / rerank is a stronger but more complex evaluation or experimental path
- the benchmark ceiling is not identical to the default product path
So the benchmark is not fake, but it is easy to overread.
2. Local-first is broadly credible, but absolute wording needs caution
One of MemPalace's core value claims is local-first. Based on the current repository, that direction is broadly credible: the main storage, retrieval, normalization, and chunking paths run locally and do not depend on SaaS APIs.
But if you state it as "finish installation, unplug immediately, and every scenario works with no caveats," that goes too far. Default embedding assets, optional benchmark rerank, and Wikipedia lookup-style boundaries still exist. The more accurate version is:
It is a local-first system, not an absolutely proven offline-for-every-cold-start scenario system.
IV. Overall Judgment
If the question is: "Is it a pure scam?"
My answer is: It does not look like one.
Pure scam projects usually have three traits:
- no real runnable code
- key capabilities exist only in videos or marketing language
- once you follow the call chain, the core logic turns out to be empty
MemPalace does not fit that pattern. It has real ingestion code, real local storage, real search, real MCP tooling, and some genuinely useful engineering ideas.
But if the question is: "Is anything clearly overstated?"
My answer is also: Yes, and in a systematic way.
The overstatement does not mainly take the form of fake code. It mainly comes from blending together three different layers:
- the current default implementation
- benchmark / experimental paths
- README and design-vision narrative
Once those three layers are mixed, readers will overestimate project maturity.
So the fairest conclusion is neither "scam" nor "masterpiece," but:
A project with real engineering substance, but a persistent tendency for the narrative to outrun the implementation.
V. How to Read This Project
If you plan to keep reading this book or evaluating MemPalace, the safest order is:
- Trust the primary code path first:
normalize -> chunk -> store -> search -> MCP - Then read the benchmarks, separating raw, hybrid, and rerank
- Only then read the README / AAAK / Hall / Closet / agent narrative, defaulting to "direction" rather than "current state"
If you reverse that order, it is very easy to be pulled in by the worldview first and then keep discovering implementation gaps afterward.
The point of this appendix is not to "sentence the project." It is to give readers a steadier ruler: confirm what is already true, then discuss where it should go next.