← Go back to blogs

Memory States in AI Agents: From Context Engineering to Enterprise Customer Intelligence

A structured tour of the recent methods from vector stores to temporal knowledge graphs, and how these architectures translate into real-world customer context systems inside large organizations.

Memory Systems AI Agents Context Engineering LLMs Enterprise

Large language models are stateless by design. Every conversation begins fresh, with no recollection of what came before. For simple Q&A this is perfectly fine. But as agents take on complex, multi-session workflows debugging a codebase over days, managing a customer relationship over months, conducting long-horizon research statefulness stops being a luxury and becomes a hard requirement.

The question researchers and engineers are racing to answer is not whether agents should have memory, but how that memory should be structured, stored, retrieved, and evolved. This post surveys the most significant methods in use today, with an eye on what makes each approach distinct.

Memory transforms a stateless reactive model into a stateful adaptive entity one capable of relationship building, trajectory-based learning, and genuine personalization.

Why the Context Window Is Not Enough

The context window is an agent's working memory everything immediately available for reasoning, answering, and acting. Designing an agent's memory is, fundamentally, context engineering: deciding which tokens enter that window, and how they're organized.

The problem is that context windows are bounded. Even with 128K or 200K token windows, long-running agents accumulate histories far too large to re-inject wholesale. The signal gets diluted. Costs compound. And the agent inevitably loses track of things it learned earlier.

This is why a new class of external memory architectures has emerged systems that offload, compress, and selectively retrieve context from outside the model. Three broad philosophies now dominate:

Method 1: Vector Store Memory (RAG-based)

The earliest and most widely deployed approach. Interactions are embedded using a sentence encoder and stored in a vector database such as Pinecone, Weaviate, Qdrant, or FAISS. On each new query, the agent retrieves the top-k most semantically similar past fragments by cosine similarity and injects them into the context window.

# Canonical retrieval pattern
query_embedding = embed(user_query)
top_k_memories = vector_db.search(query_embedding, k=5)
context = format_for_prompt(top_k_memories) + user_query
  

This is fast, simple, and model-agnostic. Its weakness is surface-level recall it finds what sounds similar, not what is causally or relationally connected. Two memories that are semantically distant may still be deeply relevant to each other; cosine similarity alone cannot surface that link.

Traditional RAG is insufficient for agent memory. It provides static document retrieval, but agents need dynamic knowledge that evolves with each interaction.

Method 2: Hierarchical In-Context Memory (MemGPT / Letta)

MemGPT, reborn as Letta, introduced an operating-system-inspired memory hierarchy. The agent maintains distinct memory tiers, each with different latency and capacity characteristics:

Main Context (hot):
  ├── Message Buffer    → recent conversation turns
  ├── Core Memory       → pinned facts: user profile, persona, active goals
  └── Working Notes     → agent-editable scratchpad

External (cold):
  ├── Recall Memory     → full interaction history (searchable on disk)
  └── Archival Memory   → large document stores (retrieval via embeddings)
  

The key innovation is that core memory blocks are editable by the agent itself the agent can write and overwrite them mid-conversation, effectively curating its own context. This turns memory management from a static indexing problem into a dynamic, agent-driven process.

Rather than passively retrieving from an external store, the agent actively decides what to remember and what to discard much like human working memory shunting information between short-term and long-term stores.

Method 3: Rolling Summarization

A lighter-weight alternative to full retrieval pipelines: the model periodically condenses the growing conversation transcript into a rolling summary. Instead of storing raw messages, the agent maintains a compressed representation of what happened and surfaces it as a prefix to each new prompt.

# Summarization-based session handoff
summary_t = LLM.summarize(conversation_so_far)
new_context = [system_prompt, summary_t] + recent_messages[-N:]
  

Summarization is computationally cheap and integrates cleanly with any LLM. The tradeoff is lossy compression nuances, exact phrasing, and specific facts can be dropped. It works well when the agent needs continuity of intent more than verbatim recall of content.

Method 4: Temporal Knowledge Graphs (Zep / Graphiti)

The most structurally ambitious approach treats memory not as a flat store of text, but as a typed, time-stamped knowledge graph. Zep's Graphiti engine, introduced in January 2025, organizes agent memory into three hierarchical subgraphs:

Knowledge Graph G = (N, E, φ)

├── Episode Subgraph (Gₑ)
│     Raw input data: messages, text, JSON
│     Non-lossy; source of entity extraction

├── Semantic Entity Subgraph (Gₛ)
│     Entities extracted from episodes
│     Resolved and deduplicated across sessions

└── Community Subgraph (Gᶜ)
      Clusters of strongly connected entities
      High-level summaries of cross-entity relationships
  

On retrieval, vector search narrows candidate memories while graph traversal returns their relational context who did what, with whom, and when. This dual-channel retrieval mirrors how human memory interleaves episodic recall (specific events) with semantic recall (conceptual associations).

On the LongMemEval benchmark which stress-tests cross-session synthesis and temporal reasoning Zep's Graphiti achieves up to 18.5% accuracy improvement over baseline implementations while simultaneously reducing response latency by 90%.

Graph-based memory captures not just what occurred, but why encoding causal and relational structure that vector similarity alone cannot represent.

Method 5: Agentic Memory (A-MEM and Zettelkasten Architectures)

A-MEM (2025) draws inspiration from the Zettelkasten method a note-taking philosophy where ideas are not stored in isolation but linked explicitly to related ideas. Each memory unit is created with typed outgoing links to other memories, forming an interconnected knowledge network that grows denser and more useful over time.

Unlike passive storage systems, A-MEM allows memory to evolve autonomously: as new interactions arrive, the agent can create new notes, update existing ones, and add new links no human curator required. This positions memory as an active, self-organizing substrate for reasoning rather than a passive archive.

Method 6: Session Memory Consolidation (OpenAI Agents SDK Pattern)

A practical pattern emerging in production systems separates memory into two explicitly staged tiers:

@dataclass
class AgentState:
    profile: Dict            # static user facts (CRM data)
    global_memory: Notes     # long-term, cross-session preferences
    session_memory: Notes    # in-session observations (staging)

# At session end: promote only high-signal session notes to global
def consolidate(state):
    for note in state.session_memory.notes:
        if signal_strength(note) > threshold:
            state.global_memory.notes.append(note)
    state.session_memory.notes.clear()
  

Session memory accumulates observations during a single interaction. At session close, a consolidation function powered by an LLM evaluates each note and promotes only the ones that carry durable signal into global memory. This prevents noise accumulation and keeps the long-term store clean.

Method 7: Reinforcement Learning on Episodic Memory (MemRL)

The most recent frontier: rather than hand-engineering memory management rules, train the agent to learn them. MemRL (January 2026) frames memory management as a reinforcement learning problem the agent receives reward signals for correctly leveraging past experience and is penalized for ignoring relevant history or hallucinating.

Over time, the agent develops a policy for what to write to memory, when to retrieve, and how to use retrieved context in its reasoning chain. This makes memory not just a passive data store but an active, learned capability one that improves with deployment rather than degrading.

MemRL and similar approaches signal a shift from memory as infrastructure to memory as a learned cognitive skill one the agent gets better at over its operational lifetime.

The Four Dimensions of Memory Design

Any memory system for an AI agent must make explicit design choices along four axes:

1. TEMPORAL SCOPE
   Short-term (within session) ←→ Long-term (cross-session)

2. STRUCTURE
   Flat text buffers ←→ Typed knowledge graphs

3. AGENCY
   Passive retrieval ←→ Agent-managed read/write

4. EVOLUTION
   Static archives ←→ Self-updating, RL-trained memory
  

The methods surveyed above occupy different positions in this design space. No single approach dominates all four axes production systems increasingly combine them in hybrid architectures.

Comparison Summary

Method Structure Strengths Limitations
Vector Store (RAG) Flat embeddings Fast, model-agnostic, scalable Surface-level recall; no relational context
Hierarchical (Letta/MemGPT) Tiered in-context + archival Agent-editable; fine-grained control Engineering complexity; per-agent state management
Rolling Summarization Compressed text Lightweight; easy integration Lossy; loses verbatim facts and nuance
Temporal Knowledge Graph (Zep) Typed graph with timestamps Relational + temporal reasoning; high accuracy Higher setup cost; graph maintenance overhead
Agentic (A-MEM / Zettelkasten) Linked note network Self-organizing; grows richer over time Link quality depends on extraction quality
Session Consolidation Staged tiers (session → global) Practical; controls noise accumulation Requires good consolidation heuristics
RL on Episodic Memory (MemRL) Policy-learned memory ops Self-improving; adapts to task domain Requires training infrastructure and reward design

Applying This in Practice: Customer Context in Large Organizations

The memory methods surveyed above are not merely academic. One of their most commercially significant applications is building customer context inside large organizations giving AI agents (and the humans they assist) a complete, living picture of every customer across teams, channels, and time.

In a large enterprise, customer data is never in one place. It lives in the CRM, in support tickets, in email threads, in call transcripts, in product usage logs, in billing systems, and in Slack messages. No single team sees the whole picture. The result is that every interaction with a customer starts with partial information and agents, human or AI, are left reconstructing context that should already exist.

Enterprise customer context is fundamentally a memory problem. The organization has all the data. What it lacks is a unified, retrievable, relational representation of that data structured for reasoning, not just storage.

The Three-Layer Architecture

Building robust customer context in a large org requires three distinct memory layers, each serving a different purpose and operating at a different time horizon.

Layer 1: Structured Profile (Core / Hot Memory)

This is the always-present, always-injected foundation. It is pulled from deterministic sources CRM, ERP, data warehouse and contains facts that are stable enough to pin into every agent session without retrieval overhead.

CustomerProfile:
  account_id:       "ACCT-00492"
  name:             "Meridian Logistics Group"
  tier:             "Enterprise"
  industry:         "Supply Chain"
  arr:              $840,000
  products:         ["Platform Pro", "Analytics Add-on"]
  csm:              "Priya Nair"
  renewal_date:     "2025-11-01"
  health_score:     72  # amber
  region:           "EMEA"
  primary_contact:  "James Okafor (VP Operations)"
  

This block is injected verbatim into the system prompt on every interaction. It is never retrieved it is always there. Think of it as the agent's permanent briefing card for this account. When the health score changes, a webhook updates the profile store and the next session picks up the new value automatically.

Layer 2: Episodic History (Vector Store + Knowledge Graph)

This is where every interaction the organization has ever had with the customer lives: support tickets, sales call notes, QBR summaries, NPS responses, escalation records, onboarding milestones, email threads, and product usage anomalies. This corpus is too large to inject wholesale it must be retrieved selectively.

For straightforward lookups ("find past issues with the billing module"), vector search over embedded interaction records works well. But for enterprise accounts, relational and temporal context is often more important than semantic similarity. Knowing that a billing complaint in Q3 preceded an escalation in Q4, which was handled by a CSM who has since left the company, requires a knowledge graph not cosine similarity.

# Dual-channel retrieval for enterprise customer context
def retrieve_customer_context(account_id, query, top_k=8):

    # Channel 1: semantic similarity (vector search)
    semantic_hits = vector_db.search(
        embed(query),
        filter={"account_id": account_id},
        k=top_k
    )

    # Channel 2: relational traversal (knowledge graph)
    graph_context = kg.traverse(
        start_node=account_id,
        relation_types=["ESCALATED_TO", "COMPLAINED_ABOUT",
                        "RENEWED", "CHURNED_RISK", "CONTACT_CHANGE"],
        max_hops=2,
        time_window_days=180
    )

    return merge_and_rank(semantic_hits, graph_context)
  

The graph schema for an enterprise account encodes not just events, but the stakeholders involved, their roles, and how those roles have changed. A contact who was a champion two years ago but left the company is still relevant their departure is itself a signal worth knowing.

# Enterprise account knowledge graph schema
Nodes:
  Account       → id, name, tier, health_score
  Contact       → id, name, role, status (active/churned/departed)
  Interaction   → id, type, date, sentiment, channel
  Issue         → id, category, severity, resolution_status
  Product       → id, name, version, adoption_score

Edges:
  (Contact)      -[WORKS_AT]->       (Account)
  (Contact)      -[PARTICIPATED_IN]-> (Interaction)
  (Interaction)  -[RAISED]->          (Issue)
  (Issue)        -[ESCALATED_TO]->    (Contact)
  (Account)      -[USES]->            (Product)
  (Account)      -[AT_RISK_OF]->      (Churn | Expansion)
  

Layer 3: Session Memory with Consolidation (Staging)

During a live interaction a support call, a sales conversation, a QBR the agent accumulates new observations that have not yet been written to the long-term store. These sit in a session-scoped staging layer. At the end of the interaction, a consolidation pass evaluates each observation and promotes only the ones that carry durable signal.

# Session consolidation for enterprise customer interactions
class CustomerSessionMemory:

    def __init__(self, account_id):
        self.account_id = account_id
        self.staged_observations = []

    def observe(self, note: str, category: str):
        self.staged_observations.append({
            "note": note,
            "category": category,   # e.g. "risk", "sentiment", "stakeholder"
            "timestamp": now()
        })

    def consolidate(self, llm):
        for obs in self.staged_observations:
            signal = llm.score_signal(obs)   # 0.0 to 1.0
            if signal > 0.65:
                long_term_store.write(
                    account_id=self.account_id,
                    observation=obs,
                    ttl_days=365
                )
        self.staged_observations.clear()
  

The consolidation threshold (0.65 above) is a tunable parameter. For a churn-risk signal ("customer mentioned they are evaluating a competitor"), the LLM should score this close to 1.0 and always promote it. For a passing comment ("the call audio was a bit choppy"), it scores low and gets discarded.

The Four Enterprise-Specific Complications

The three-layer architecture above works cleanly in theory. In a real large organization, four additional complications must be addressed explicitly.

1. Data Silos and Ingestion Pipelines

Customer data is scattered across systems that were never designed to talk to each other. A unified ingestion pipeline must pull from all of them, normalize the output, embed it, and write it into the shared memory store. The practical stack looks like:

Data Sources                Ingestion Layer           Memory Store
────────────────            ───────────────           ────────────
Salesforce (CRM)     →
Zendesk (tickets)    →      ETL / CDC pipeline  →     Vector DB (Qdrant)
Gong (call records)  →      (Fivetran, Airbyte,  →    Knowledge Graph (Neo4j)
Email (IMAP/Gmail)   →       or custom Kafka)    →    Profile Store (Postgres)
Product telemetry    →
Billing (Stripe)     →
  

Change-data-capture (CDC) is preferred over batch ETL for high-value accounts you want the memory store to reflect CRM updates within minutes, not overnight. A renewal date change or a health score drop should propagate to the profile layer immediately.

2. Access Control and Data Permissions

Not all customer context should be visible to all agents or all users. A support agent handling a tier-1 ticket should not see the account's commercial negotiation history. A sales rep closing an upsell should not see the raw content of a sensitive support escalation. Access control must be enforced at the memory layer, not just the application layer.

# Row-level access control on memory retrieval
def retrieve_with_acl(account_id, query, caller_role):
    allowed_categories = ACL_MATRIX[caller_role]
    # e.g. "support" role → ["technical", "product", "sentiment"]
    #      "sales" role   → ["commercial", "expansion", "renewal"]
    #      "executive" role → ["*"]  (all categories)

    return vector_db.search(
        embed(query),
        filter={
            "account_id": account_id,
            "category": {"$in": allowed_categories}
        }
    )
  

For regulated industries (financial services, healthcare), this access control layer may need to satisfy compliance requirements GDPR data residency, HIPAA audit trails, SOC 2 logging. Memory stores must be designed with these constraints from the start, not retrofitted later.

3. Memory Freshness and TTL Policies

Stale context is often worse than no context. A support interaction from three years ago may describe a product version that no longer exists. A contact listed as the economic buyer may have left the company. A contract value stored in the profile may predate a significant upsell. Every record in the memory store needs an explicit freshness policy.

TTL Policies by Memory Category:
  ├── Profile (CRM fields)        → No TTL; refresh via CDC webhook
  ├── Support tickets             → Retain 24 months; archive after
  ├── Sales call transcripts      → Retain 36 months
  ├── Sentiment observations      → Retain 12 months; decay weight after 6
  ├── Stakeholder graph edges     → No TTL; mark departed contacts as inactive
  └── Product usage telemetry     → Rolling 90-day window; summarize older data
  

The decay approach for sentiment is particularly important: a negative sentiment signal from 18 months ago should carry less weight than one from last month, but it should not be deleted the history of the relationship matters even when individual signals age out.

4. Multi-Contact Account Modeling

Enterprise accounts are not single entities. They have champions, economic buyers, IT gatekeepers, end users, and executive sponsors each with their own history, preferences, and relationship to the product. A memory system that flattens all of this into a single "customer" record loses critical signal.

# Correct: contact-aware retrieval
context = retrieve_customer_context(
    account_id="ACCT-00492",
    contact_id="CONTACT-James-Okafor",  # the specific person on this call
    query="renewal objections and pricing concerns"
)

# context now includes:
#   - James Okafor's personal interaction history
#   - Issues he personally raised or escalated
#   - His sentiment trend over the last 4 quarters
#   - Account-level context (shared across all contacts)
  

This contact-level granularity matters because the champion and the economic buyer often have different (sometimes opposing) concerns. An agent that treats them as interchangeable will give advice that is at best generic and at worst counterproductive.

What to Build First: A Phased Rollout

The full architecture described above is not a single-quarter project. A phased approach is strongly recommended, with each phase delivering standalone value while laying the foundation for the next.

Phase What to Build Memory Methods Used
1. Profile Layer Sync CRM fields into a structured profile injected at session start Core memory (hot injection)
2. Interaction History Embed and index support tickets, call transcripts, emails in a vector store Vector store (RAG)
3. Session Consolidation Capture and promote durable signals from live interactions Session memory consolidation
4. Knowledge Graph Build relational + temporal graph over contacts, issues, events Temporal knowledge graph (Zep/Neo4j)
5. Access Control + Compliance Layer ACL policies; implement TTL and audit logging All layers
6. Adaptive Memory RL-trained consolidation and retrieval policies for high-value accounts MemRL / agentic memory
Start with what already exists. Your CRM is already a memory store it is just not yet shaped for agent retrieval. Phase 1 is connecting what you have, not building something new.

The Payoff: What Good Customer Context Enables

When all three layers are working structured profile, episodic history with relational graph, session consolidation the capabilities unlocked go well beyond faster support responses. The organization gains, effectively, an institutional memory about each customer relationship: one that does not degrade when CSMs change, does not reset when a ticket is closed, and does not stay siloed within a single team.

Concretely, agents equipped with full customer context can proactively surface renewal risks before they become churn events, personalize QBR narratives with account-specific history rather than generic templates, route escalations to the right person based on stakeholder relationship graphs, and flag when a new contact on an account matches the profile of a previous champion who drove expansion.

These are not incremental improvements to existing workflows. They represent a qualitative shift in what AI agents can do inside an organization from answering questions about customers to reasoning about them.

The organization that invests in customer memory infrastructure is not just building better AI tools. It is building an institutional nervous system: one that learns from every interaction and makes that learning available to every agent that comes after.

Closing Thoughts

Memory was long treated as a footnote in AI systems, a patch applied after the fact to make stateless models feel continuous. That framing is giving way to something more intentional. Memory is now understood as the core mechanism that separates a capable model from a capable agent: one that accumulates experience, maintains coherent identity across sessions, and improves through use.

The seven methods surveyed in this post span a wide design space, from simple vector retrieval to RL-trained memory policies. None of them is universally correct. The right architecture depends on the time horizon of memory needed, the relational complexity of the domain, and the tolerance for engineering overhead. What they share is a common recognition: context is not a convenience, it is the substrate of intelligent behavior.

The enterprise customer context section makes this concrete. When memory is designed well, an organization stops losing institutional knowledge every time a CSM changes accounts or a support ticket closes. It stops treating each customer interaction as if it were the first. It starts reasoning about relationships over time rather than just responding to events in isolation. That shift, from reactive to relational, is what well-architected memory actually delivers in practice.

The field is still early. RL-trained memory management, temporal knowledge graphs, and agentic self-organization are all active research areas with significant open problems around consistency, forgetting, and trust. But the direction is clear: memory is becoming a first-class concern in AI system design, not an afterthought.

The goal is not for agents to remember everything. It is for them to remember the right things, in the right structure, available at the right moment. That is the engineering challenge that memory research is converging on.

References