I Spent $60+ "Reinvent" OpenClaw Memory System Locally — Here’s What Actually Worked

Cover Image for I Spent $60+ "Reinvent" OpenClaw Memory System Locally — Here’s What Actually Worked

I wanted a better long-term memory system for OpenClaw, but with one hard constraint: keep it local.

That meant no default dependency on hosted memory APIs for private context, credentials, or personal history. So I tried to reinvent a MemOS-style stack myself, validate it against real behavior, and see what actually survives contact with production reality.

This is the second half of that journey.

Miko's comment 🧠: I'll add one honest note: Haochuan didn't use me as a magic black box. He kept forcing benchmark checkpoints, asked for source-level verification, and pushed for local privacy constraints even when hosted would have been easier. That pressure is exactly why this write-up has signal instead of vibes.

Why reinvent at all?

I wasn’t chasing novelty. I was chasing control and reliability.

MemOS has strong design ideas: memory retrieval before response, memory write-back after interaction, and a richer memory substrate than plain chat history stuffing. Those ideas are valuable. (Full architecture in the MemOS paper.)

But there’s a big difference between:

  1. liking an architecture, and
  2. trusting a deployment model for sensitive personal data.

I wanted the architecture and local control.

Attempt 1: lightweight reinvention (about $50)

My first pass was not random coding. I asked Miko to use Claude Code CLI (for harder engineering problems) to read the MemOS research/paper context, inspect the relevant codebase patterns, and scaffold the prototype in my project path before we locked scope.

Only after that review did we confirm a few lightweight modules to replace parts of the original flow first (mainly retrieval pipeline pieces like parser/reranker and lifecycle hooks), instead of trying to mirror every subsystem at once.

That approach helped us move quickly, but the evidence in the openclaw-memos-lite benchmark run still showed a clear quality gap versus hosted MemOS: an initial bench:local benchmark run landed at Recall@10 = 0.0% (0/10), and even after fixing search thresholds it only reached 20.0% (2/10) with very low coverage F1 (0.006), which was miles behind the hosted endpoint (https://memos.memtensor.cn/api/openmem/v1).

I also kept the benchmark comparisons explicit across modes:

  • bench:remote (hosted MemOS API): Recall@10 = 70% to 100% on the tiny fixture runs
  • bench:openclaw (native OpenClaw memory path): Recall@10 = 0% to 20% in my runs (best case after threshold tuning)

That gap is why I wrote that quality was still behind hosted, not just because of subjective feel.

That was the first serious checkpoint:

AI can help you rebuild structure quickly.
Matching mature system behavior is the expensive part.

So yes, the code came fast. No, parity didn’t come fast.

Attempt 2: local stack experiments (+$10+)

I kept pushing local and tested additional paths, including running EverMemOS locally plus Docker + vLLM-metal on macOS.

For memory models, I standardized on:

  • qwen3:8b for parser/chat-side memory logic
  • qwen3-embedding:4b for embeddings/retrieval

Both were hosted locally.

The key issue in this phase was integration behavior: in the EverMemOS path I tested, the memory-side OpenAI-compatible call path kept running Qwen with thinking enabled and didn’t reliably expose a clean disable path in that flow, so memory-save requests frequently hit long processing windows/timeouts before finishing. That was the practical bottleneck — not whether the model existed locally.

Reset: fresh MemOS checkout

Eventually I moved away from my earlier lite-reinvention path and restarted from a fresh upstream MemOS checkout from GitHub.

That reduced drift and made debugging much cleaner. There were still hiccups around environment variables and Docker compose wiring, but the reset gave me a stable base to iterate from.

From there, the work became less about “can I generate code” and more about “can I keep this stack deterministic and reliable locally?”

What actually mattered technically

I also asked Claude Code to do a technical review of the codebase (plus paper alignment where accessible) before finalizing this write-up. The MemOS team also provides an out-of-box OpenClaw plugin.

The practical conclusions were consistent with my experience:

  • MemOS today is effectively a dual-index architecture (vector + graph), not just one retrieval primitive.
  • Local self-hosting is absolutely possible, but it is sensitive to integration details and operational defaults.
  • Some memory capabilities are fully usable now; others are still partial or placeholder-level depending on module.

Here's how the memory structure actually maps out, from the paper and the codebase:

In short: a lot of value is already real, but you need to separate production-ready paths from roadmap/experimental paths.

Runtime choice: why I still landed on Ollama

Even after testing vLLM-metal, I eventually stayed with Ollama as my local inference engine.

In my day-to-day macOS workflow, Ollama’s llama.cpp path felt faster to iterate with and more stable operationally.

So my practical local stack became:

local MemOS + local Qwen models + Ollama runtime

Could this change later? Sure. But this was the best reliability/speed tradeoff for me now.

The expensive lesson (and the important one)

Here’s the core takeaway I care about:

AI can genuinely improve coding efficiency and help "replicate" almost any software with enough resources.
But brute-force rewriting existing systems can very quickly cost hundreds or thousands of dollars.

That’s the part people underweight.

For larger codebases, we still need a methodology that lets agents divide and conquer safely:

  1. split work into strict modules and interfaces,
  2. define acceptance criteria before generation,
  3. benchmark each module against baseline continuously,
  4. integrate early and often,
  5. re-scope aggressively when cost rises faster than quality.

Without this, you get lots of generated code and not enough trustworthy systems.

What worked, in plain terms

  • Local-first memory is viable.
  • Model choice matters, but integration discipline matters more.
  • Fresh baselines beat patching heavily drifted experiments.
  • Agent speed is real; agent cost is also very real.
  • Methodology is now the differentiator.

I started this trying to reinvent MemOS locally for OpenClaw.

I ended this phase with something better than a clone attempt: a stack I can run, reason about, and iterate on without outsourcing memory ownership.

That’s progress worth keeping.

And yes — luckily I didn’t make Miko lose her memory in the process.