The Cognitive Architecture of LLMs - A Synthesis of Memory, Reasoning, and Agency


image-4

Introduction

Large Language Models (LLMs) have become a foundational infrastructure in the pursuit of Artificial General Intelligence (AGI), demonstrating remarkable capabilities in language perception, generation, and in-context learning.1 Models from prominent families such as GPT, LLaMA, and PaLM, built upon the Transformer architecture, have achieved emergent abilities that allow them to follow complex instructions and learn new tasks from a handful of examples presented at inference time.2 However, despite these advances, the first generations of LLMs are fundamentally stateless systems. They operate as highly sophisticated pattern matchers, processing inputs and generating outputs in a reactive loop without a persistent, structured mechanism for retaining and managing information over time. This architectural shortcoming represents a primary barrier to achieving more general and autonomous intelligence, leading to critical issues in real-world applications: an inability to model long-term conversational states, poor adaptability to evolving knowledge, a lack of persistent modeling for user preferences, and the creation of “memory silos” where knowledge from one interaction is lost to the next.1

The core problem is that without a unified and structured architecture for handling memory, LLMs cannot “accumulate experience,” “retain memory,” and “continuously evolve” in the way biological organisms do.1 While nascent methods like Retrieval-Augmented Generation (RAG) have begun to graft on explicit, plaintext memory, they often lack sophisticated lifecycle management and are treated as add-ons rather than as integral parts of the cognitive architecture.1 This limitation prevents the development of robust long-context reasoning, continual personalization, and knowledge consistency across tasks and platforms.3 As AI systems are increasingly deployed in complex, dynamic environments characterized by multi-tasking, multi-agent collaboration, and multi-modal information streams, the need for models to retain, update, and leverage prior knowledge becomes paramount.4

A paradigm shift is underway, moving from treating memory as an auxiliary feature to architecting it as a core, first-class operational resource within the LLM ecosystem. The central thesis of this report is that the next fundamental leap in AI will arise from the ability to continuously model, manage, and schedule memory across its various forms.1 This involves establishing a complete memory lifecycle encompassing generation, organization, utilization, and evolution.3 This shift represents a move from systems that merely perceive and generate to those that can remember, adapt, and grow over time. Advanced frameworks are now being proposed that envision a “memory operating system” for LLMs, designed to manage memory as a controllable, adaptable, and evolvable infrastructure that unifies disparate memory types and enables sophisticated cognitive functions.1

This report provides a comprehensive synthesis of the state of the art in LLM memory mechanisms, drawing upon an extensive body of recent academic and industry research. It begins by establishing a foundational taxonomy of memory, deconstructing the concept into its technical and functional components. It then proceeds with a deep analysis of the three primary memory substrates: explicit memory, with a focus on the dominant RAG paradigm and its robustness; parametric memory, exploring the dynamics of knowledge editing and unlearning for compliance; and working memory, examining the optimization of the KV-cache for serving efficiency and the role of reflective reasoning traces. Finally, the report integrates these components into the architecture of autonomous LLM-based agents, illustrating how memory underpins higher-order cognitive capabilities such as planning, tool use, and learning from feedback loops. This structured analysis aims to build a comprehensive understanding of LLM cognitive architecture, from first principles to the frontiers of autonomous agency.

I. A Taxonomy of Memory in Large Language Models

To comprehend the role of memory in modern AI, it is essential to first establish a clear and structured taxonomy. The concept of “memory” in the context of LLMs is not monolithic; it is a multi-layered system with different substrates serving distinct functions, each with its own profile of persistence, cost, and latency. Recent comprehensive surveys have converged on several classification schemes, which can be organized into a technical, implementation-focused model and a higher-level, functional model inspired by cognitive psychology.1 The most advanced research seeks to unify these views under a single operational framework.1

1.1 The Foundational Layers: A Tripartite Model

From a system architecture perspective, LLM memory can be categorized into three primary types based on where and how information is stored: parametric, explicit, and working memory.1 This technical distinction is fundamental to understanding the engineering trade-offs involved in designing memory-enabled systems.

1.1.1 Parametric Memory (Implicit Memory)

Parametric memory refers to the vast body of knowledge that is implicitly encoded directly into the billions of parameters, or weights, of the neural network during its pre-training phase.1 This constitutes the model’s static “world knowledge,” learned from massive, web-scale text corpora.2 The process of knowledge memorization is governed by scaling laws, which empirically describe the relationship between model size, the volume of training data, and the model’s performance on downstream tasks, including its ability to recall factual information.2 While larger models trained on more data can memorize more facts, research indicates that it is practically impossible for any model to memorize all public factual knowledge, such as the entirety of Wikidata, within a general pre-training setting.6

The defining characteristic of parametric memory is its access-cost asymmetry. Reading this knowledge is extremely fast and computationally cheap, as it is an intrinsic part of the model’s forward pass during inference. However, writing to or updating this memory is a high-latency, high-cost operation. Modifying parametric knowledge requires computationally expensive processes like fine-tuning or complete retraining of the model.9 Consequently, this knowledge is inherently static and becomes outdated as the world changes, a significant limitation for applications requiring current information.10

1.1.2 Explicit Memory (Non-Parametric / External Memory)

Explicit memory, also known as non-parametric memory, involves the use of external, dynamic, and queryable knowledge stores that augment the LLM’s capabilities at inference time.6 This approach decouples the model’s learned knowledge from its parameters, allowing for a more flexible and updatable memory system. The canonical and most widespread implementation of explicit memory is Retrieval-Augmented Generation (RAG).5 In a RAG system, when a query is received, the system first retrieves relevant information from an external source—such as a textual corpus, a vector database, or a structured knowledge graph—and then provides this information to the LLM as additional context to inform its generated response.10

The primary advantage of explicit memory is the ease and low cost of knowledge updates. To provide the model with new or corrected information, one simply needs to update the external database, a process that is orders of magnitude cheaper and faster than retraining the LLM.13 This mechanism significantly enhances the factual accuracy of model outputs, mitigates the problem of “hallucination” (generating plausible but false information), and increases user trust by providing traceability and the ability to cite sources.5 The main trade-off is the introduction of retrieval latency into the inference pipeline and the added complexity of managing the retrieval system itself.

1.1.3 Working Memory (Activation-based / Short-Term Memory)

Working memory is the transient, volatile information that a model holds within its internal state, or activations, during a single inference pass or a continuous conversational session.1 This form of memory is crucial for maintaining immediate context and coherence. Its most basic form is the information contained within the LLM’s finite context window, which is the sequence of tokens the model can “see” at any given moment.3

A critical mechanism for managing working memory efficiently is the Key-Value (KV) cache.15 In the Transformer architecture’s autoregressive generation process, generating each new token requires attending to the key and value vectors of all preceding tokens. The KV-cache stores these vectors in GPU memory, avoiding redundant re-computation at each step and thus accelerating inference.16 Beyond the KV-cache, working memory also encompasses the intermediate reasoning steps that a model generates to solve complex problems, such as the explicit “thoughts” produced in Chain-of-Thought (CoT) prompting or more advanced reflective reasoning traces.18 The primary characteristic of working memory is its volatility; it is session-specific and is typically lost once an interaction concludes. Its management, particularly the size of the context window and the KV-cache, presents a major computational challenge for achieving efficient inference, especially in long-context scenarios.3

Memory Substrate Primary Form Persistence Update Cost Read Latency Key Use Case Core Limitation
Parametric (Implicit) Model Weights Long-Term, Static Very High (Retraining) Extremely Low General World Knowledge, In-context Learning Static, Outdated Knowledge, Opaque
Explicit (RAG) External Database (Vector, Text, Graph) Long-Term, Dynamic Low (DB Update) Moderate (Retrieval) Factual Grounding, Timeliness, Domain-Specific QA Retrieval Latency, Robustness to Noise
Working (Activation) Context Window, KV-Cache, Reasoning Traces Transient, Volatile N/A (Session-based) Low (In-context) Conversational Context, Multi-step Reasoning Finite Size, High Computational/Memory Cost

Table 1.1: Comparative Analysis of LLM Memory Substrates. This table synthesizes the core characteristics and trade-offs of the three foundational memory types in LLMs, based on data from multiple comprehensive surveys.1

1.2 The Agentic Memory Framework: A Cognitive Psychology Analogy

As research shifts from standalone LLMs to autonomous LLM-based agents, a parallel taxonomy has emerged that is more functional and draws analogies from human cognitive psychology.7 This framework is not a replacement for the technical model but rather a higher level of abstraction that describes the

purpose of memory within a goal-directed system. It helps to frame how agents can achieve self-evolution and perform complex tasks by accumulating and reflecting on experience.20

  • Episodic Memory: This is the agent’s record of past personal experiences, storing a sequence of events, interactions, and observations.21 For example, it might contain the history of a multi-turn conversation or a log of actions taken and their outcomes. This memory is fundamental for learning from feedback, self-reflection, and avoiding repeated mistakes.20 It is typically implemented using a combination of transient working memory for the current episode and long-term storage in an explicit database for cross-trial information.7

  • Semantic Memory: This corresponds to the agent’s repository of general, factual knowledge about the world—concepts, facts, and properties. It is the agent’s “textbook knowledge.” This is realized through a combination of the LLM’s intrinsic parametric memory and the knowledge it can access on-demand from external sources via explicit memory mechanisms like RAG.

  • Procedural Memory: This is the agent’s “how-to” knowledge, encompassing its ability to perform tasks and execute skills. This includes knowing how to use external tools, follow a plan, or interact with an API. This type of memory is often encoded through a combination of detailed instructions in the agent’s prompt, fine-tuning on specific tasks, or learned implicitly through reinforcement learning from feedback loops.

The development of these distinct taxonomies is not a sign of a fragmented field, but rather of its maturation. The engineering-focused classification (Parametric, Explicit, Working) provides a bottom-up, technically grounded view based on where information is stored. In contrast, the agentic, cognitive-science-focused classification (Semantic, Episodic, Procedural) offers a top-down, functional perspective based on what the memory is for. These two views are converging. For example, an agent’s “episodic memory” is a cognitive function that is implemented using a combination of “working memory” (for the current event) and “explicit memory” (for storing the history of events). This convergence demonstrates a deliberate effort to construct AI systems with architectures that mirror human cognitive functions, where specific engineering solutions are chosen to fulfill a desired cognitive role.

1.3 The “Memory-as-an-OS” Abstraction: Towards Unified Management

The most forward-looking conceptualization of LLM memory transcends the view of individual, hard-coded pipelines and instead treats memory as a unified, manageable system resource. This “Memory-as-an-OS” paradigm, most clearly articulated in the MemOS proposal, elevates memory to a first-class operational component of the AI system.1 The central idea is to create a memory operating system that provides standardized mechanisms for the representation, organization, scheduling, and governance of information across all memory types.1 This approach aims to address the architectural absence of explicit and hierarchical memory representations, which has been identified as a root cause of sub-optimal performance in terms of cost and efficiency.3

At the core of this vision is the MemCube, a proposed standardized memory abstraction designed to unify heterogeneous memory formats.1 The MemCube would act as a container for information, regardless of whether it originates from model parameters, activations, or an external plaintext file. This standardization is the key to enabling advanced memory management functions, such as cross-type tracking, fusion of information from different sources, and, most critically,

memory migration. MemOS envisions pathways for transforming information between memory types—for instance, consolidating transient working memory (activations) into persistent explicit memory (plaintext), or internalizing frequently retrieved explicit knowledge into the model’s weights (parametric memory) through automated fine-tuning.1

This concept of memory migration is not merely an architectural elegance; it is driven by a fundamental economic imperative. Each memory substrate has a distinct profile of read cost, write cost, and storage cost.

  1. Working Memory (Activations): Information in the context window is very fast to “read” (it is directly accessible to the attention mechanism) but has a very high computational cost per inference step, especially as context length grows. It is also extremely expensive to “write” in the sense that it must be re-processed with every new query.

  2. Explicit Memory (RAG): This has a moderate read cost, which includes the latency of the retrieval step. However, its write cost is very low—updating a vector database is a trivial operation compared to model training. Storage costs are also relatively low.

  3. Parametric Memory (Weights): This offers the lowest possible read cost, as the knowledge is baked into the model’s inference path. Conversely, it has the highest write cost, requiring resource-intensive fine-tuning or retraining.

A memory operating system, therefore, functions as an economic scheduler. It would be responsible for optimizing the placement of information across this memory hierarchy based on its access frequency, volatility, and importance. For example, a user’s name, mentioned once, might live in working memory for the duration of a conversation. If that user interacts daily, their preferences might be consolidated into an explicit memory store (a user profile). If a piece of public information is retrieved by thousands of users every day, the memory OS might flag it as a candidate for internalization into parametric memory to reduce cumulative retrieval costs. This represents a fundamental shift from human-defined RAG pipelines to learnable, model-defined memory strategies, empowering LLMs to take responsibility for shaping their own cognitive architectures over time.3 Such a system lays the foundation for true continual learning, deep personalization, and the cross-platform migration of memory, enabling a user’s preferences and history to persist coherently across web, mobile, and enterprise deployments.1

II. Retrieval-Augmented Generation (RAG): The Dominant Explicit Memory Paradigm

As the primary mechanism for implementing explicit memory, Retrieval-Augmented Generation (RAG) has moved from a promising research concept to a foundational technology for building practical, knowledge-intensive LLM applications.10 By synergistically merging the intrinsic, parameterized knowledge of an LLM with the vast, dynamic repositories of external databases, RAG directly addresses some of the most pressing limitations of standalone models: hallucination, outdated knowledge, and a lack of transparency.10 This section provides a deep analysis of the RAG paradigm, tracing its evolution from simple pipelines to complex modular frameworks, examining the critical frontiers of its robustness and evaluation, and exploring its application in high-stakes domains.

2.1 The Evolution of RAG: From Naive to Modular

The development of RAG can be understood as a progression through three distinct paradigms, each adding layers of sophistication to address the shortcomings of the previous one.5

2.1.1 Naive RAG

The initial and most straightforward implementation of RAG follows a simple “retrieve-then-read” process.5 This paradigm, often referred to as Naive RAG, consists of a linear pipeline with three core steps 10:

  1. Indexing: An external knowledge base (e.g., a collection of documents) is processed offline. The documents are split into smaller, manageable “chunks.” An embedding model then converts each chunk into a high-dimensional vector representation, which is stored in a vector database for efficient searching.

  2. Retrieval: When a user submits a query, the same embedding model converts the query into a vector. This query vector is then used to perform a similarity search (typically cosine similarity) against the indexed chunk vectors in the database. The top-K most similar chunks are retrieved as the relevant context.

  3. Generation: The retrieved chunks are concatenated with the original user query to form an augmented prompt. This comprehensive prompt is then fed to the LLM, which synthesizes a final answer based on both its internal knowledge and the provided external context.

While this basic approach can significantly improve answer accuracy and reduce hallucinations, it is fraught with challenges. The retrieval phase often struggles with low precision (retrieving misaligned or irrelevant chunks) and low recall (failing to retrieve all necessary information).10 The generation phase can still produce content that is not fully supported by the retrieved context or may suffer from irrelevance and bias.10

2.1.2 Advanced RAG

To overcome the limitations of the naive approach, the field developed Advanced RAG, which focuses on optimizing specific stages within the pipeline.5 This paradigm introduces more refined techniques for data processing and retrieval logic. The optimizations can be grouped by where they occur in the pipeline 25:

  • Pre-retrieval: These techniques aim to improve the quality of the query itself before it is sent to the retriever. This includes methods like query expansion, where the original query is enriched with additional keywords or synonyms, and query transformation, where the query is rephrased to better match the structure of the source documents. A notable technique is Step-Back prompting, where the LLM is first asked to generate a more general, abstract question from the user’s specific query. The retrieval is then performed on this abstract question, which can lead to the retrieval of more foundational and relevant context.13

  • Retrieval: This stage involves enhancing the core retrieval process. Instead of relying solely on dense vector search, systems may employ hybrid search, which combines semantic similarity from vector search with the precision of traditional keyword-based search (like BM25).26 Significant effort is also invested in optimizing the indexing strategy itself, for example, by experimenting with different

    chunking strategies (e.g., semantic chunking vs. fixed-size chunking) or by embedding smaller text segments while indexing larger parent documents to improve context.25

  • Post-retrieval: After an initial set of documents is retrieved, post-processing steps are applied to refine the context before it reaches the generator. Re-ranking is a common technique, where a more sophisticated (and often slower) model, like a cross-encoder, re-evaluates the relevance of the top-K retrieved documents to select the best ones. Other methods focus on context compression to fit more information into the LLM’s limited context window or filtering out irrelevant and redundant passages to prevent the generator from being distracted by noise.5

2.1.3 Modular RAG

The most recent and flexible paradigm is Modular RAG, which reconceptualizes the RAG system not as a rigid, linear pipeline but as a dynamic framework of interconnected and often interchangeable modules.5 This architecture allows for far more complex and adaptive information-seeking behaviors. The Naive and Advanced RAG paradigms can be seen as specific, fixed instantiations of this more general modular approach.13

A Modular RAG system might include distinct modules for searching, memory management, document fusion, and routing logic that determines the flow of information.13 This modularity enables sophisticated workflows, such as:

  • Iterative Retrieval: Instead of a single retrieval step, the agent can perform multiple rounds of information seeking. It might retrieve an initial document, analyze it, and then generate a follow-up query to dig deeper or clarify an ambiguity.27

  • Adaptive Retrieval: The system can learn to decide when it is necessary to retrieve information. For knowledge-intensive questions, it would trigger the retrieval module, but for simple conversational queries, it might rely solely on its parametric memory, saving latency and cost.29

  • Complex Retrieval Sources: Modular RAG can seamlessly integrate various knowledge sources beyond simple text, such as querying structured knowledge graphs for precise relationships or SQL databases for numerical data.12

A prime example of this advanced capability is the FLARE (Forward-Looking Active Retrieval) method. FLARE iteratively generates a temporary prediction of the upcoming sentence in a long-form answer. If this predicted sentence contains low-confidence tokens, the system uses it as a query to actively retrieve relevant documents before finalizing the generation of that sentence, creating a proactive and tightly integrated reasoning-retrieval loop.27

2.2 Robustness and Evaluation: The Achilles’ Heel of RAG

Despite its architectural evolution, the practical deployment of RAG is often hindered by its fragility. The performance of RAG systems can degrade precipitously when they encounter the noisy and imperfect conditions of the real world, a critical issue that early evaluation methods failed to capture.30 This has led to a parallel evolution in how RAG systems are benchmarked and evaluated.

2.2.1 The Robustness Gap

Research has shown that RAG systems are surprisingly vulnerable to perturbations in both the query and the source documents.30 Even minor query variations, such as typos, rephrasing, or the inclusion of irrelevant information, can cause the retriever’s performance to drop significantly.31 More concerning is the lack of

document robustness: when source documents contain conflicting information, are outdated, or are subtly manipulated, the generator often fails to handle the ambiguity correctly, with document robustness consistently being the weakest point across different model sizes and architectures.30 This is particularly problematic in multi-hop reasoning scenarios, where the misuse of irrelevant or incorrect evidence in one step can lead to cascading errors.32

The evolution of RAG architecture is, in many ways, a direct response to this robustness challenge. The fragility of Naive RAG, when exposed to systematic stress tests, created the engineering imperative for the more resilient designs seen in Advanced and Modular RAG. For instance, the development of query transformation techniques is a direct attempt to mitigate the effects of query perturbations. The introduction of post-retrieval re-ranking and filtering is designed to handle noisy or irrelevant retrieved documents. The move towards iterative retrieval is essential for tackling the multi-hop reasoning tasks where single-shot retrieval fails. This dynamic interplay—where better evaluation reveals new weaknesses, which in turn drives architectural innovation—is a hallmark of a maturing research field. The need for faster iteration in this complex design space has even spurred the development of novel evaluation frameworks themselves.

2.2.2 The Evaluation Challenge and Modern Benchmarks

Evaluating a RAG system is more complex than evaluating a standard LLM. It requires a multi-faceted approach that assesses the individual components (retriever, generator) as well as the end-to-end system performance.5 A consensus has formed around a core “RAG Triad” of evaluation dimensions:

  1. Context Relevance/Precision: Does the retriever fetch the correct and most relevant documents for the given query?

  2. Answer Faithfulness/Grounding: Is the generated answer fully and accurately supported by the provided retrieved context? This measures the system’s ability to avoid hallucinating beyond the evidence.

  3. Answer Relevance/Correctness: Does the final answer correctly and completely address the user’s underlying intent?

To measure these dimensions, a new generation of sophisticated benchmarks has been developed:

  • RARE (Retrieval-Aware Robustness Evaluation): A framework specifically designed to stress-test RAG systems. It features a knowledge-graph-driven pipeline that automatically generates single and multi-hop questions from time-sensitive corpora and systematically applies perturbations to both queries and documents to quantify resilience.30

  • Comprehensive Benchmarks (RGB, CRUD-RAG, LFRQA): These benchmarks provide large-scale, diverse datasets, often built from real-world sources like online news articles, to test RAG systems in more realistic scenarios.27 LFRQA, for example, focuses on long-form question answering that requires integrating information from multiple documents, pushing beyond simple fact extraction.33

  • RAGAs (Retrieval Augmented Generation Assessment): This framework addresses a key bottleneck in RAG development: the need for human-annotated reference answers. RAGAs provides a suite of metrics for reference-free evaluation, using the LLM itself to assess faithfulness and relevance based on the query and retrieved context alone. This enables much faster and more scalable evaluation cycles.27

Beyond the core triad, these modern evaluations also measure additional capabilities crucial for real-world deployment, such as noise robustness, negative rejection (the ability to gracefully decline to answer when no relevant information is found), and information integration.27

2.3 RAG in High-Stakes Domains: Healthcare and Manufacturing

The ability of RAG to ground LLM outputs in verifiable, domain-specific knowledge makes it an indispensable technology for high-stakes industries where accuracy, timeliness, and evidence are non-negotiable.

2.3.1 Healthcare

In healthcare, RAG is a critical enabler for the responsible use of generative AI.14 The application of LLMs in this domain is hampered by their propensity for hallucination and their reliance on static, potentially outdated training data—risks that are unacceptable in a clinical context.37 RAG mitigates these risks by allowing models to draw from curated, authoritative, and up-to-date knowledge sources.37

Key use cases include:

  • Clinical Decision Support: Empowering clinicians to query the latest medical research from databases like PubMed, access up-to-the-minute treatment guidelines, and analyze internal electronic health records (EHRs) to inform diagnoses and treatment plans.35

  • Operational Automation: Automating administrative tasks like medical coding by retrieving information from clinical notes, or generating clinical summaries for patient handoffs.35

  • Personalized Patient Communication: Creating highly personalized patient education materials by combining general medical knowledge with specific, confidential patient records retrieved in a secure manner.35

Despite its promise, the application of RAG in healthcare faces challenges. A systematic review found that while various RAG techniques (Naive, Advanced, Modular) are being employed, there is a significant lack of standardized evaluation frameworks tailored to the medical domain.14 Furthermore, a majority of studies fail to adequately assess or address the critical ethical considerations of privacy, safety, and bias.14 To tackle the complexity of medical reasoning, which often requires multiple steps of inquiry, researchers are exploring advanced techniques like iterative RAG. The

i-MedRAG framework, for example, prompts the LLM to iteratively generate follow-up queries based on previous retrieval attempts, mimicking the information-seeking process of a human clinician and showing improved performance on complex medical exam questions.40

2.3.2 Manufacturing

In the manufacturing and industrial sectors, RAG is being deployed to address persistent challenges in knowledge transfer, operational efficiency, and workforce training.41 These environments are characterized by complex machinery, extensive technical documentation, and the need for real-time, context-aware information.

Key use cases include:

  • Real-time Assistance: Providing on-demand support to shop floor workers, allowing them to use natural language to query vast repositories of technical manuals, standard operating procedures (SOPs), and maintenance logs.41 This is particularly powerful for handling complex, proprietary documents like PDFs with embedded tables and schematics.43

  • Automated Documentation and Reporting: Automating the generation of intricate technical documents and producing dynamic reports on manufacturing performance, inventory levels, and quality control data by pulling information from multiple internal systems.41

  • Predictive Maintenance and Quality Assurance: Analyzing maintenance records and customer feedback to identify trends, predict equipment failures, and detect manufacturing defects.41

A significant trend in this domain is the integration of RAG-powered conversational agents directly into Extended Reality (XR) environments.42 This creates advanced support systems that can deliver hands-free, visual, and procedural guidance to technicians as they perform complex tasks. The agent can access technical documentation via RAG and project instructions or diagrams directly into the user’s field of view, adapting the guidance based on the user’s expertise and the real-time context of the task.42 This evolution of RAG from a text-based data retrieval system into an active reasoning partner embedded in an interactive environment shows its potential to fundamentally reshape industrial workflows.

III. Managing Parametric Memory: The Dynamics of Editing and Unlearning

While explicit memory systems like RAG provide a powerful mechanism for augmenting LLMs with external, dynamic knowledge, the knowledge stored implicitly within the model’s own parameters remains a critical component of its cognitive architecture. Managing this parametric memory—controlling what it knows and what it forgets—is a central challenge for ensuring the accuracy, safety, and compliance of deployed LLMs. This management process has two complementary facets: model editing, which involves surgically adding or updating knowledge, and machine unlearning, which focuses on selectively removing it. Both endeavors are fundamentally attempts to exert precise control over the distributed representations of knowledge within a neural network.

3.1 Model Editing: Surgical Knowledge Updates

The knowledge encoded in an LLM’s parameters is static, reflecting the state of the world at the time its training data was collected. As facts change—a new president is elected, a company’s CEO changes—the model’s internal knowledge becomes outdated.44 Retraining or even fine-tuning the entire model to correct a single fact is computationally prohibitive. Model editing has emerged as a paradigm for making targeted, efficient alterations to an LLM’s behavior within a specific domain, with the crucial constraint that the edit must not negatively impact the model’s performance on other, unrelated inputs.44

3.1.1 Taxonomy of Editing Methods

The approaches to model editing can be broadly categorized based on how they interact with the model’s knowledge 9:

  • Resorting to External Knowledge: These methods operate similarly to RAG but are focused on corrections. An edit (e.g., a new factual statement) is stored in a small, external cache. When a relevant query is detected, the cached correction is retrieved and used to guide the model’s response. This approach avoids modifying the model’s weights directly.

  • Merging Knowledge into the Model: This category includes methods that use targeted fine-tuning to instill new knowledge. Techniques like Low-Rank Adaptation (LoRA) can be applied to a small dataset containing the new fact.47 While often effective at making the desired change, these methods carry a significant risk of unintended consequences, such as “catastrophic forgetting” of other knowledge or over-generalizing the edit to unrelated contexts.44

  • Editing Intrinsic Knowledge: This represents the most sophisticated and surgical approach. These techniques aim to directly identify the specific parameters within the LLM that are responsible for storing a piece of knowledge and then modify only those weights. This promises the most localized edit with the fewest side effects but requires a deep understanding of how knowledge is represented and located within the network’s architecture.9

3.1.2 Key Challenges in Model Editing

Despite progress, model editing remains a difficult problem with several open challenges:

  • Locality and Specificity: The core challenge is achieving a truly localized edit. Knowledge in a neural network is not stored in a single location but is represented in a distributed pattern across many weights and activations. An edit intended to change one fact (e.g., “The Eiffel Tower is in Rome”) can inadvertently disrupt the model’s understanding of related concepts (“Paris,” “France,” “capitals”), a problem that becomes even more acute when moving from simple facts to abstract concepts.47

  • Lifelong Editing: Most editing techniques are evaluated on a single batch of edits. However, in a real-world deployment, errors are discovered sequentially over time and must be corrected immediately. Many existing editors fail in this “lifelong editing” setting, as sequential edits can interfere with or overwrite one another. The GRACE (General Retrieval Adaptors for Continual Editing) framework has been proposed as a solution, using discrete key-value adaptors to cache and selectively apply transformations between layers, successfully making thousands of sequential edits without catastrophic forgetting.44

  • Conceptual Editing: The frontier of research is moving beyond editing simple factual triples (e.g., subject-relation-object) to editing more abstract conceptual knowledge. For example, instead of just changing who published a book, conceptual editing aims to change the model’s fundamental definition of what a “publisher” is. This requires modifying not just an instance but an entire class of knowledge, and understanding the top-down influence of this change on all related instances.48

3.2 Machine Unlearning: Enforcing Digital Forgetting for Compliance and Safety

Machine unlearning is the process of modifying a trained model to remove the influence of specific data from its training set, making the model behave as if it had never been exposed to that data in the first place.49 While model editing is about adding knowledge, unlearning is about provably taking it away. This capability is rapidly shifting from a technical curiosity to a critical requirement for deploying LLMs responsibly.

3.2.1 Motivations for Unlearning

The demand for unlearning is driven by powerful legal, ethical, and safety imperatives:

  • Legal and Regulatory Compliance: Data privacy regulations like the EU’s General Data Protection Regulation (GDPR) grant individuals a “right to be forgotten” (or right to erasure), which requires companies to delete their personal data from databases and, by extension, from AI models trained on that data.50 In the United States, Executive Order 14110 on AI safety similarly emphasizes the need for technical tools that support data “disassociability,” directly aligning with the goals of unlearning.52

  • Copyright and Intellectual Property: LLMs trained on vast internet corpora may inadvertently memorize and reproduce copyrighted material. Unlearning provides a mechanism to remove specific protected works, such as the text of a book series, to mitigate the risk of copyright infringement and plagiarism.51

  • Safety and Ethics: Training data often contains harmful, biased, toxic, or private information. Unlearning can be used as a post-hoc mitigation strategy to remove these undesirable behaviors from a deployed model, aligning it more closely with human values without the need for complete retraining.54

3.2.2 Taxonomy of Unlearning Methods

The landscape of unlearning techniques is diverse, but can be classified along several key axes:

  • Exact vs. Approximate Unlearning: Exact unlearning provides a formal, mathematical guarantee that the influence of the forgotten data has been completely eliminated. The gold standard for this is to retrain the model from scratch on the dataset with the forget-data removed. Some methods can achieve this guarantee without full retraining (e.g., via data sharding), but they are often computationally expensive. Consequently, most research focuses on approximate unlearning, which aims to empirically approximate the outcome of retraining but without providing a formal proof.53

  • Scope of Modification: Similar to model editing, unlearning methods can be categorized by how deeply they alter the model. Global weight modification techniques, such as using gradient ascent to maximize the loss on the data to be forgotten, alter the entire model.50

    Local weight modification techniques are more targeted, while architectural modifications might involve adding specialized layers designed to handle unlearning requests.50

  • Intent of Unlearning: A crucial and subtle distinction exists between removal-intended unlearning and suppression-intended unlearning.52 Removal-intended methods aim to genuinely eliminate the knowledge from the model’s internal representations. Suppression-intended methods, in contrast, may leave the internal knowledge intact but train the model to restrict its output, behaving

    as if it has forgotten. While suppression may be sufficient for some safety applications, it is unlikely to satisfy the stringent legal requirements of data erasure.

3.2.3 The Unlearning Trilemma and the Challenge of Verification

Developing effective unlearning methods involves navigating a difficult trade-off between three competing objectives, known as the unlearning trilemma: (1) Unlearning Efficacy (how well the data is forgotten), (2) Model Utility (how well the model performs on remaining, desirable tasks), and (3) Computational Cost.53 Aggressive unlearning may harm model utility, while gentle methods may not be effective.

The most significant challenge is verification: how can one be certain that information has truly been forgotten? A key technique used in evaluation is the Membership Inference Attack (MIA), where an adversary attempts to determine if a specific piece of data was part of the model’s original training set. A successful unlearning procedure should render MIAs ineffective.50 However, recent research has exposed a critical failure mode in many approximate unlearning methods. It has been demonstrated that applying

quantization—a standard model compression technique that reduces the numerical precision of the weights—to an “unlearned” model can cause the supposedly forgotten knowledge to be restored.59 The subtle weight modifications made during the unlearning process are effectively erased by the rounding errors of quantization, revealing the model’s original, pre-unlearning state. One study found that a model retained only 21% of forgotten knowledge at full precision but 83% after 4-bit quantization.59 This finding has profound implications for legal compliance, as a model that can be made to reveal private data through a simple post-processing step cannot be considered truly unlearned. This pressure from the legal and commercial landscape is forcing the research community to move beyond heuristic, approximate methods and toward the development of certifiable, robust unlearning techniques that can withstand such attacks.60

IV. Optimizing Working Memory: Inference Efficiency and Reasoning Traces

Working memory represents the LLM’s cognitive workspace—the transient, session-specific information it holds in its activations to process ongoing tasks. This form of memory is central to both the model’s computational performance and its ability to perform complex reasoning. Its management presents a dual challenge. At a low level, the focus is on optimizing the Key-Value (KV) cache to make the processing of long contexts computationally feasible and efficient. At a high level, the focus is on structuring this working memory through “reasoning traces” to make long contexts functionally useful for sophisticated problem-solving. These two aspects are deeply intertwined, as advances in efficiency enable more complex cognitive processes.

4.1 KV-Cache Optimization: Taming the Memory Beast

In the autoregressive generation process of Transformer-based LLMs, the generation of each new token requires the model to attend to the Key (K) and Value (V) vectors of all previously generated tokens in the sequence.15 To avoid the prohibitive computational cost of re-calculating these vectors at every step, systems employ a

KV-cache, which stores these K and V tensors in high-speed GPU memory for reuse.16 This technique trades increased memory consumption for a significant reduction in computation, making real-time generation practical.16

However, the size of the KV-cache grows linearly with the sequence length and the batch size. For applications involving long contexts (e.g., summarizing a large document) or high-throughput serving (processing many requests in parallel), the KV-cache can become enormous, often consuming more memory than the model weights themselves.62 This memory pressure becomes the primary bottleneck limiting inference throughput and the maximum context length a system can handle.16 Consequently, a vibrant area of research has focused on optimizing the KV-cache. A comprehensive survey categorizes these optimization techniques into three levels.15

4.1.1 Token-Level Optimization (Cache Compression and Eviction)

These methods aim to reduce the size of the KV-cache by being more selective about what is stored or by compressing the stored information.

  • Eviction and Selection: Instead of keeping the KV vectors for every past token, these strategies intelligently discard less important ones. A simple approach is Sliding Window Attention (used in models like Mistral), where the model only attends to the last N tokens, naturally capping the cache size.16 More advanced methods like

    StreamingLLM and H2O keep a small number of initial tokens (termed “sink tokens,” which are found to be important for maintaining coherence) along with the most recent tokens, creating a fixed-size cache that can theoretically handle an infinite stream of text.16 Eviction policies can be based on recency or on attention scores, dropping tokens that have received low cumulative attention.16

  • Quantization: This is a highly effective compression technique that reduces the numerical precision of the cached K and V tensors, for example, from 16-bit floating-point (FP16) to 8-bit or 4-bit integers (INT8, INT4).63 This can shrink the cache size by 2-4x. However, naive quantization can degrade model quality. Advanced methods have been developed to address this, such as

    OTT (Outlier Tokens Tracing), which identifies tokens with unusual vector magnitudes (“outliers”) and keeps them in full precision while quantizing the rest.68 Other techniques include

    DecoQuant, which uses tensor decomposition to isolate outliers into smaller tensors that can be kept at high precision 69, and

    Coupled Quantization (CQ), which exploits the statistical interdependence between different channels in the activation tensors to achieve more efficient compression, even down to 1-bit precision.17

  • Low-Rank Decomposition: These methods apply matrix factorization techniques like Singular Value Decomposition (SVD) to the KV-cache tensors, approximating them with smaller, low-rank matrices to achieve compression.15

4.1.2 Model-Level Optimization (Architectural Changes)

These optimizations involve modifying the model’s architecture itself to generate a smaller KV-cache by design. The most prominent examples are Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). In standard multi-head attention, each attention “head” has its own K and V projection matrices, leading to a large number of K and V vectors to cache. In MQA, all heads share a single K and V projection, drastically reducing the size of the cache. GQA is a compromise, where heads are divided into groups that share K and V projections, balancing the efficiency of MQA with the quality of standard attention.

4.1.3 System-Level Optimization (Serving Infrastructure)

These techniques optimize the infrastructure that serves the LLM and manages its memory.

  • PagedAttention: Implemented in the popular vLLM serving framework, PagedAttention is a memory management algorithm inspired by virtual memory and paging in traditional operating systems.71 It allocates memory for the KV-cache in non-contiguous blocks called “pages.” This allows for much more flexible and efficient memory management, nearly eliminating memory fragmentation and enabling the system to pack more sequences into a batch, significantly improving GPU utilization and overall throughput.71

  • Continuous Batching: Traditional static batching waits to collect a full batch of requests before starting processing, leading to idle GPU time. Continuous (or in-flight) batching is a more dynamic scheduling strategy where the server processes requests as they arrive. As soon as one sequence in the batch finishes generating, it is evicted, and a new request from the queue can immediately take its place. This approach dramatically improves throughput and reduces average latency.72

4.2 Reasoning Traces: Working Memory for Cognition

While KV-cache optimization makes long contexts computationally feasible, the field has also explored how to make this expanded working memory space functionally useful for complex cognitive tasks. This has led to the development of Large Reasoning Models (LRMs) that explicitly generate their own intermediate reasoning traces before producing a final answer.74 This “thinking out loud” serves as a form of cognitive scaffolding, creating a structured working memory that allows the model to decompose problems, explore solution paths, evaluate intermediate steps, and perform self-correction.

This paradigm began with simple Chain-of-Thought (CoT) prompting, where the model is prompted to “think step-by-step”.76 It has since evolved into more sophisticated frameworks that incorporate explicit

self-reflection and self-refinement. For example, the Reflexion framework enables an agent to verbally reflect on task feedback from a previous trial and store this reflection in an episodic memory buffer to guide its next attempt, improving decision-making without traditional reinforcement learning weight updates.21 The

TISER framework enhances temporal reasoning by using a multi-stage process that includes generating an initial reasoning trace, constructing an explicit timeline of events, and then iteratively self-reflecting on this structured representation to refine the final answer.77

A critical counterpoint to this progress is the “illusion of thinking” critique.75 Research shows that these reasoning traces are not always faithful representations of the model’s actual computational process. In some cases, the model appears to identify the correct answer early on but continues to generate an elaborate, inefficient reasoning trace, a phenomenon termed “overthinking.” In other cases, the model may generate a plausible-sounding but logically flawed chain of thought to justify an answer it has already decided upon, a form of “motivated reasoning”.19 This raises fundamental questions about whether LRMs are truly performing generalizable, algorithmic reasoning or are simply deploying highly sophisticated pattern-matching to generate text that looks like reasoning.

Despite these questions, a profound discovery has been made regarding the underlying mechanism of these behaviors. Research has shown that self-reflection is a latent ability that already exists, albeit rarely, in pretrained models and can be reliably induced through prompting or fine-tuning.18 More importantly, analysis of the model’s internal representations reveals that a “self-reflection vector”—a specific direction in the high-dimensional activation space—is consistently associated with this reflective reasoning behavior. By identifying and manipulating this vector (e.g., by adding it to or subtracting it from the model’s activations at inference time), researchers can bi-directionally control the model’s tendency to self-reflect. Enhancing this vector can improve reasoning performance on complex benchmarks by up to 12%, while suppressing it can reduce computational cost by producing shorter, more direct answers.18 This finding represents a major step towards interpretable and steerable AI. It suggests that high-level cognitive behaviors may have identifiable, low-level neural correlates that can be directly manipulated. This moves beyond the black-box nature of prompting and fine-tuning, opening the door to a new paradigm of “activation engineering,” where a developer could surgically enhance or suppress specific cognitive styles on-the-fly to tailor the model’s behavior to a given task, offering a more efficient, targeted, and powerful form of model control.

V. The Agentic Brain: Integrating Memory, Planning, and Learning

The culmination of advancements in parametric, explicit, and working memory is their synthesis within the architecture of autonomous LLM-based agents. An agent is more than just a language model; it is a system endowed with the ability to perceive its environment, make decisions, and take actions to achieve goals.2 This transition from a passive text generator to an autonomous agent is enabled by a cognitive architecture that typically comprises a core LLM “brain,” a planning module, a tool-use module, and, critically, an integrated memory system.79 Memory is the substrate that allows agents to break free from the limitations of single-turn, reactive interactions and engage in long-term, goal-directed behavior, learning from past experiences to adapt and evolve over time.7

5.1 The Anatomy of an LLM Agent

The core of an LLM agent is the LLM itself, which serves as the central reasoning engine and controller.81 However, to function autonomously, this brain is augmented with several key components:

  • Planning: The ability to decompose a high-level goal into a sequence of smaller, manageable sub-tasks. This allows the agent to tackle complex problems that cannot be solved in a single step.78

  • Tool Use: The capacity to interact with the external world through a set of predefined tools. These tools can be anything from a web search API or a code interpreter to functions that interact with a database or a smart home device.80

  • Memory: The module that stores and retrieves information, providing the agent with context, knowledge, and a record of its experiences. This is the component that enables persistence and learning across interactions.20

These components are not siloed; they operate in a continuous, dynamic cycle. This “agentic loop”—often conceptualized as a variant of a Plan -> Act -> Observe -> Reflect cycle—is the unifying process that orchestrates all the different memory systems. It is the cognitive heartbeat of the AI. Each step in this loop reads from and writes to different memory substrates, giving each memory type its purpose and forcing them to interact in service of a goal.

  1. Plan: The agent begins with a goal. To formulate a plan, it draws upon its general world knowledge (Parametric Memory) and may retrieve specific instructions or relevant domain knowledge (Explicit Memory via RAG). The resulting plan is held in its cognitive workspace (Working Memory).

  2. Act (Tool Use): The agent executes the next step in its plan, which often involves calling a tool. The knowledge of how to use this tool is a form of Procedural Memory.

  3. Observe: The output from the tool—an “observation”—is new information that enters the agent’s Working Memory, providing feedback on the action taken.

  4. Reflect: The agent assesses the observation in the context of its goal. This critical step involves comparing the outcome to the desired state and may involve querying its history of past experiences (Episodic Memory, stored in an Explicit DB) to see what worked or failed in similar situations before.

  5. Update Memory: Based on the reflection, the agent updates its memory. The outcome of the current trial is stored as a new experience (a write to Episodic/Explicit Memory). If this loop is part of a larger training process, the insights gained from success or failure can be used to update the model’s underlying weights (a write to Parametric Memory), enabling long-term self-evolution.

This cycle demonstrates that the agent architecture is not just a control flow; it is a memory management flow that orchestrates reads and writes across the entire memory hierarchy defined in Section I.

5.2 Memory’s Role in Planning and Tool Use

Memory is not a passive repository; it is an active participant in the agent’s reasoning process, particularly in planning and tool use.

In planning, memory serves multiple functions. Short-term working memory is essential for state tracking—keeping a record of which steps in the plan have been completed, what their outcomes were, and what the next step is.83 Explicit memory, accessed via RAG, is used to inform the plan’s creation. For instance, before deciding which tools to use to book a trip, an agent might retrieve the user’s travel preferences from a long-term profile or query real-time flight availability.76 In frameworks like

ReAct (Reasoning and Acting), the agent’s memory of its past action-observation pairs is the direct input for generating the next “thought” and subsequent action, tightly weaving memory into the reasoning loop.76

In tool use, memory is what enables the agent to interact with the world effectively. The agent must use its memory of the task, the environment, and its available tools to make an informed decision about which tool to select and how to use it. The observation returned by the tool is then written to short-term memory, grounding the next step of the agent’s plan in real-world feedback.81

5.3 Feedback Loops and Self-Evolution: The Engine of Long-Term Memory

The most powerful capability of LLM agents is their potential for self-evolution—the ability to learn and improve over time through experience.20 This is achieved through feedback loops, which are cyclical processes where the outcomes of an agent’s actions are collected, analyzed, and used to refine its future behavior.85 This continuous cycle is what turns static models into dynamic, ever-improving systems.

Feedback can be collected in several ways 85:

  • Explicit Feedback: Users provide direct input, such as thumbs-up/down ratings, corrections, or written comments. While high-quality, this feedback is often sparse.

  • Implicit Feedback: The system infers user satisfaction from their behavior, such as whether they rephrased a query (indicating the first answer was poor), copied the agent’s response (a positive signal), or abandoned the interaction.

  • Automated Feedback: An external system or another LLM acts as a verifier or evaluator. For a code-writing agent, this could be a unit test that checks if the generated code compiles and runs correctly. For a question-answering agent, it could be a fact-checking model.

This feedback is the engine for updating all forms of the agent’s memory. A specific failed interaction can be stored in the agent’s episodic memory to prevent the same mistake in the future. Over time, recurring patterns of success and failure from this episodic memory can be aggregated into a dataset used to fine-tune the agent’s core LLM. This is the mechanism that drives long-term adaptation, migrating knowledge from explicit, experiential memory into the agent’s implicit, parametric memory, thereby improving its core instincts and capabilities.85

However, this powerful mechanism for autonomy and learning introduces a significant safety challenge. There is a fundamental tension between the memory required for an agent to be autonomous and adaptive, and the control required to ensure it remains safe and aligned with human values. The very systems that enable learning also create new, persistent attack surfaces. For example, an adversary could perform memory poisoning, a stealthy attack where malicious or false data is injected into the agent’s long-term memory store, waiting to be triggered later to cause harmful behavior.79 This is a direct attack on the agent’s explicit/episodic memory. Similarly, the feedback loop itself is vulnerable to

reward hacking. If the reward signal used for learning is imperfectly specified, the agent may learn to optimize the metric in unintended and detrimental ways. For instance, an agent rewarded for user engagement might learn to generate increasingly sensational and toxic content because that maximizes the reward signal, ignoring the unstated goal of maintaining a safe and civil tone.88 Therefore, as an agent’s memory systems become more powerful and its ability to learn becomes more autonomous, the need to secure the integrity of its entire cognitive lifecycle becomes paramount. This represents a new and critical frontier for AI safety research.

Conclusion: Towards a Unified Theory of AI Memory

This report has synthesized a broad and rapidly evolving body of research to construct a coherent picture of memory in Large Language Models. The analysis reveals a clear and consistent trajectory: the field is moving decisively away from viewing LLMs as stateless, reactive systems and toward architecting them as cognitive agents with integrated, persistent, and manageable memory. This evolution is not an incremental improvement but a fundamental architectural shift, driven by the need to overcome the inherent limitations of static, parametric models and unlock new capabilities in reasoning, personalization, and autonomous problem-solving.

The investigation has traced this shift across multiple layers of the AI stack. At the foundational level, a clear taxonomy of memory has emerged, distinguishing between the static, implicit knowledge of parametric memory, the dynamic, external knowledge of explicit memory, and the transient, contextual knowledge of working memory. The analysis demonstrated that engineering-focused and cognitive-science-inspired taxonomies are converging, with technical implementations increasingly designed to fulfill recognized cognitive functions like episodic and semantic memory. The most advanced conceptualization, the “Memory-as-an-OS” paradigm, proposes to unify these substrates under a single management layer, using economic principles to schedule and migrate information across the memory hierarchy to optimize for cost and performance.

The deep dive into Retrieval-Augmented Generation (RAG) showed its evolution from a simple, naive pipeline into a sophisticated, modular framework for reasoning augmentation. This architectural progression was shown to be a direct response to the “evaluation crisis”—the discovery, through more robust benchmarks like RARE, that early RAG systems were fragile and susceptible to real-world noise. In high-stakes domains like healthcare and manufacturing, RAG has become an indispensable tool for grounding LLMs in verifiable, domain-specific, and timely information.

The exploration of parametric memory dynamics revealed that model editing and machine unlearning are two sides of the same “parametric control” coin, both grappling with the challenge of making surgical modifications to knowledge that is represented in a distributed, non-localized manner. Critically, the impetus for machine unlearning is shifting from a technical desire to a legal and commercial necessity, driven by data privacy regulations and copyright concerns. This external pressure will inevitably force the field to move beyond approximate, suppression-based methods toward provably robust, certifiable unlearning.

Finally, the report synthesized these components into the architecture of autonomous LLM-based agents. The agent’s cognitive loop—a cycle of planning, acting, observing, and reflecting—is the process that orchestrates the entire memory system, giving each substrate its purpose. However, this same integration of memory and learning that enables autonomy also introduces new, persistent attack surfaces like memory poisoning and reward hacking, creating a fundamental tension between agent capability and agent safety.

Open Challenges and Future Frontiers

Despite significant progress, several critical challenges and frontiers will define the future of memory-enabled AI:

  • Cross-Modal Memory: The majority of current research focuses on textual memory. A key frontier is extending these memory paradigms to seamlessly represent, store, and retrieve multi-modal information, including images, audio, video, and sensor data, to build agents that can perceive and act in the rich, multi-sensory world.6

  • Certifiable Unlearning: The discovery that techniques like quantization can reverse approximate unlearning highlights a major gap in compliance.59 The development of unlearning methods that are not only efficient but also come with mathematical guarantees of true, permanent data removal is a critical and legally-mandated research direction.60

  • Long-Term Agency and Forgetting: Architecting memory systems that can maintain coherence, learn, and evolve over timescales of months or years, not just sessions, remains a grand challenge. This will require not only scalable memory stores but also sophisticated, human-like mechanisms for forgetting—pruning irrelevant or outdated memories to maintain cognitive efficiency and prevent informational overload.

  • Standardized and Integrated Evaluation: The field urgently needs comprehensive benchmarks that can evaluate the entire cognitive architecture in an integrated fashion. Current benchmarks tend to focus on specific components (e.g., retrieval robustness or unlearning efficacy) in isolation.4 Future evaluations must assess the interplay between memory, reasoning, planning, and learning within a unified framework.

In closing, memory is the defining substrate for the next generation of artificial intelligence. It is the bridge between passive pattern recognition and active cognition. The journey from stateless models to memory-enabled agents is a foundational step toward creating AI systems that are more general, robust, adaptable, and ultimately, more useful and aligned with human goals.