Memrl Self Evolving Agents Episodic Memory Rl Guide 2026

Date: 2026-05-14
Track: paper-poc
Paper: arXiv:2601.03192 (v1: 2026-01-06, v2: 2026-02-12)
Repo: https://github.com/MemTensor/MemRL
Environment: macOS Darwin 24.6.0, Python 3 (system)

What Was Reproduced

A minimal sandbox PoC of MemRL's core Intent-Experience-Utility (IEU) triplet architecture with Two-Phase Retrieval. Full repo installation was not attempted (requires conda env, ALFWorld, BigCodeBench dependencies). The goal was to demonstrate the mechanism conceptually in pure Python without external dependencies.

PoC Code

import math

class SimpleMemRL:
    def __init__(self, top_k_semantic=5, top_k_q=2):
        self.memory = []
        self.top_k_semantic = top_k_semantic
        self.top_k_q = top_k_q

    def _cosine_sim(self, a, b):
        # naive word-overlap similarity (sandbox limitation — real MemRL uses embeddings)
        set_a = set(a.lower().split())
        set_b = set(b.lower().split())
        if not set_a or not set_b:
            return 0.0
        return len(set_a & set_b) / math.sqrt(len(set_a) * len(set_b))

    def write(self, intent, experience, initial_q=0.5):
        self.memory.append({'intent': intent, 'experience': experience, 'q': initial_q})

    def update_q(self, intent, reward):
        # Monte Carlo-style Q update
        alpha = 0.3
        for m in self.memory:
            if m['intent'] == intent:
                m['q'] += alpha * (reward - m['q'])

    def retrieve(self, query_intent):
        if not self.memory:
            return []
        scored = [(self._cosine_sim(query_intent, m['intent']), m) for m in self.memory]
        scored.sort(key=lambda x: x[0], reverse=True)
        candidates = [m for _, m in scored[:self.top_k_semantic]]
        candidates.sort(key=lambda m: m['q'], reverse=True)
        return candidates[:self.top_k_q]

Commands Run

python3 -c "... (inline PoC script as above)"

No package installs needed. Pure Python 3, no external dependencies.

Output

=== MemRL PoC: Intent-Experience-Utility Triplet ===
  [WRITE] Stored: sort a list of integers
  [WRITE] Stored: sort a list descending
  [WRITE] Stored: fix IndexError in Python
  [WRITE] Stored: optimize slow loop

--- After positive feedback on sort strategy ---
  [UPDATE] Q updated for: sort a list of integers -> Q=0.580
  [UPDATE] Q updated for: sort a list descending -> Q=0.620

--- Retrieve for new query: sort items in reverse order ---
  Retrieved [Q=0.620]: sort a list descending -> Use sorted(lst, reverse=True)
  Retrieved [Q=0.600]: optimize slow loop -> Use list comprehension or numpy vectorization

--- Negative feedback on fix strategy ---
  [UPDATE] Q updated for: fix IndexError in Python -> Q=0.240

--- Retrieve for: debugging IndexError in code ---
  Retrieved [Q=0.620]: sort a list descending -> Use sorted(lst, reverse=True)
  Retrieved [Q=0.600]: optimize slow loop -> Use list comprehension or numpy vectorization

=== PoC Complete: Q-value learning changes retrieval priority ===
Exit code: 0

What Worked

The Two-Phase Retrieval concept reproduced cleanly: semantic filter first, Q-value ranking second.
Monte Carlo Q-update (alpha=0.3) demonstrates how positive/negative feedback reshapes retrieval priority without touching model weights.
The IEU triplet structure maps directly onto the paper's formalization.

Limitations Observed

Naive similarity vs. real embeddings: Word-overlap Jaccard-style similarity doesn't capture semantic meaning well. The last retrieval query ("debugging IndexError") failed to prioritize the semantically correct experience because the word-overlap score was low. Real MemRL uses dense embeddings (e.g., text-embedding-ada-002 or similar) for Phase 1, which would fix this.
No MDP formalization: The full MemRL formalizes the LLM-memory interaction as an MDP with states (task context), actions (memory reads/writes), and rewards (task success). This PoC simplified the feedback loop to direct Q-updates.
No temporal decay: The paper's ALFWorld experiments show MemRL handles sequential multi-step tasks where memory relevance evolves across time. This PoC is stateless between episodes.
Full repo not installed: Installing the actual MemTensor/MemRL repo requires ALFWorld environment setup and LLM API credentials. Not done in this sandbox session.

Key Takeaway

The core IEU triplet and Two-Phase Retrieval mechanism are conceptually straightforward to understand but require dense embeddings and a real task loop for meaningful results. The mechanism's value emerges over multiple episodes as Q-values diverge between high-utility and low-utility experiences — a property this PoC demonstrates in miniature.

Sources

Paper: https://arxiv.org/abs/2601.03192
Full HTML: https://arxiv.org/html/2601.03192v1
GitHub (MemTensor/MemRL): https://github.com/MemTensor/MemRL
VentureBeat coverage: https://venturebeat.com/technology/memrl-outperforms-rag-on-complex-agent-benchmarks-without-fine-tuning