<- All essays
Jun 1, 2026ArchitectureMemory7 min read

Memory architecture for AI agents

How we built a tenant-scoped, append-only memory layer with hierarchical compression and conflict audit.

AI agents forget across sessions. That is the structural defect every operator runs into the moment a chat outlives a single context window. The model handles the next turn brilliantly and then loses everything the moment the conversation rolls. A customer who told us their delivery address last week tells us again this week. A buyer who already objected to a price gets the same pitch on the second touch. A repeat client gets greeted as a stranger.

The reflex is to feed more context. Stuff the system prompt with a profile blob, paste recent messages, dump CRM notes. That works at small scale and collapses at production scale. Tokens explode, latency degrades, and the model starts confidently asserting facts that were true three months ago and have since changed. Worse, when the model contradicts the database, nothing tells it the database is right. Memory drifts from helpful context into a parallel source of truth, and a parallel source of truth in an autonomous system is a liability.

We rebuilt the memory layer from the substrate up, with one design rule guiding every choice. Memory helps context. It never determines facts.

The substrate: episodes and pages

The memory layer rests on two tables. The first is an append-only event log called memory episodes. Every interaction a client has with the operating layer, whether it is a customer DM, a payment proof upload, a manager chat turn, a decision cycle outcome, writes one or more episode rows. Each row records what kind of event happened, the actor, the salient content, the tenant boundary, the timestamp, and the provenance trail. Episodes are never mutated. Corrections write new episodes that supersede earlier ones, preserving the audit chain.

The second table is the working set, called customer memory pages. Pages are the compressed view the model actually reads. Where episodes are raw history, pages are summarized, indexed, and tiered. A hot page covers the last few interactions and high-salience facts the model needs on every turn. A warm page covers recent context the model can pull on demand. A cold page holds aged or low-signal history retained for audit but not routinely surfaced. Pages compress as they age and rotate through the tiers based on recency and outcome relevance, so the working set the model sees at any moment fits a budget rather than ballooning with history.

Compression is hierarchical. Hot pages may carry near-verbatim message snippets. Warm pages collapse multi-turn exchanges into a sentence or two. Cold pages retain only the outcome signal and a pointer back to the originating episodes. The model never sees the full episode log unless an investigation explicitly retrieves it. Inference reads pages. Audit reads episodes.

Hard rules: where memory is allowed, where it stops

Memory layers are dangerous when they overreach. We codified the boundaries in code, not in prompt instructions.

Provenance is required on every memory write. A page row cannot be created without a source: which episode generated it, which actor, which channel, which correlation id. A page with no provenance does not enter the working set. This makes every memory-influenced model decision traceable back to a real event.

Memory helps context, never determines facts. Prices, stock counts, payment confirmations, ledger entries, lead times, and authority thresholds are read from the database of record on every call. If memory contains a stale claim about a price, the price lookup hits the products table and overrides the memory. The same applies to stock, invoices, payments, and any other operational truth. The model can recall that a customer asked about a product. It cannot recall what the product currently costs.

Database wins on conflict. If retrieved memory contradicts the canonical source, the canonical source wins and the delta logs to a review queue. This is the opposite of how a naive RAG system behaves, where the retrieved snippet often outranks the live lookup because it landed in the prompt last. In Laras, the retrieval is treated as a hint to the model, not as evidence. The fact lookup is treated as ground truth.

Retrieval is confidence-gated. Pages carry a confidence signal derived from recency, episode density, and outcome alignment. Low-confidence pages do not get injected into the working set. A single mention from three months ago does not become a permanent assertion about the customer's preferences.

Kind-based exclusion, language-neutral by design

A persistent risk in memory systems is that the model summarizes a price, a stock figure, or a payment status into a page, and that summary then gets read back as if it were authoritative. The naive fix is regex over the language: scrub anything that looks like a currency amount, anything that looks like a payment confirmation, anything that looks like an authority claim. That fix breaks the moment a customer writes in a language the regex was not designed for. Arabic, Catalan, Javanese, Portuguese, Chinese, Spanish, Thai, Vietnamese all surface in real customer chats. Language-specific regex is a hard stop.

We use kind-based exclusion instead. Every episode carries a kind classification declared at write time by the skill that produced it. Skills that touch money declare their episodes as financial. Skills that touch stock declare their episodes as inventory. Skills that touch payment proofs declare their episodes as payment. The memory compactor reads the kind, not the surface text. Excluded kinds never reach the page tier. The model never sees a memory page asserting a price, because the financial kind exclusion blocks the write before the compactor ever runs.

This shifts the policing from natural language pattern matching to structural classification. The LLM judges intent on read. The code enforces boundaries on write. Adding a new language to the platform does not require updating any memory rules.

Provider abstraction: swappable backends

The memory layer ships with one backend today and is designed to accept others. The memory module exposes a provider interface with the standard operations: record an episode, compact into pages, retrieve pages for a query, audit the chain back to source.

The default provider runs on an internal relational store. Episodes and pages live in the same database that holds the rest of the operating ledger, which keeps tenant isolation tight, keeps audit queries cheap, and avoids a new operational dependency. That provider is the production baseline.

The abstraction was built knowing that purpose-built memory engines are emerging fast. Each takes a different position on episodic memory, semantic graphs, and consolidation strategies. Rather than commit to one and rebuild later, we made the provider the seam. New backends slot in behind the same interface. The shadow pattern lets us run a new provider in parallel for selected surfaces, compare outputs, and promote it once it earns the trust to write production pages.

The first shadow target is the lowest-blast-radius surface with the clearest evaluation signal. Customer service comes next, guarded by tighter exclusion kinds and a higher confidence threshold. Sales multi-touch follows once the consolidation strategy across cross-session context is validated. Money, ERP, catalog authority, and pattern aggregation never get a third-party memory provider. Those stay on the internal store with deterministic governance.

What this buys

A grounded memory layer lets the operating brain hold context across sessions without inheriting the failure modes that come with letting the model believe its own recollection. The customer who told us their delivery address last week is not asked again. The buyer who already objected to a price is met with a different angle, not the same pitch. The repeat client is greeted as a repeat client, not as a stranger.

And when the model is wrong, the system catches it. Memory surfaces a hint. The fact lookup wins. The delta routes to review. The episode log preserves the trail. The next compaction cycle adjusts the page tier so the stale hint stops being injected. The loop closes.

Memory is a context organ, not a truth organ. The architecture above is what that distinction looks like in production code.