Building an Agent That Can Grow: From Prompting to Self-Transformation

February 26, 2026

#agents #openclaw #open-source #mcp #ai-infrastructure #self-hosted

Most people still use AI agents like chat interfaces with better autocomplete.

I've been moving in a different direction: building an assistant that can self-evolve through better skills, better memory, and better tool architecture.

Today was one of those days where the pattern clicked. I shipped four things that all point to the same goal: make my agent more capable over time without turning the system into an unmaintainable context blob.

Why I stopped relying on prompts alone

Prompting gets quick wins. But as your workflows grow, prompt-only systems start to crack:

context gets bloated,
failures become harder to debug,
and "intelligence" often hides brittle infrastructure.

The technical root of this problem is that a flat system prompt is a single namespace. Every instruction, constraint, persona detail, and workflow lives concatenated in one string. There's no encapsulation, no lazy loading, no separation of concerns. As the prompt grows, you hit attention dilution — the model starts discounting later instructions — and you burn token budget before you've even loaded relevant context.

So instead of squeezing more instructions into one giant prompt, I've been building modular capabilities:

reusable skills,
explicit system introspection,
stronger retrieval quality,
and dynamic tool routing.

This is what "agentic" means to me in practice: compounding capability with control.

Skill Thief (In a Good Way)

~/.openclaw/skills/claude-skill-scout

This one is simple and powerful: learn from strong communities instead of reinventing everything from scratch.

Anthropic's ecosystem has great skill patterns and standards. claude-skill-scout helps me discover and adapt those patterns into my own OpenClaw setup.

Not copy-paste. Structured borrowing.

detect useful skill design patterns,
separate transferable mechanism from style,
and map them into my local runtime constraints.

It's basically a learning pipeline for my assistant. If a better pattern exists somewhere, I want to absorb it fast.

Under the hood: Skills in Claude Code are Markdown files — with optional YAML frontmatter — stored in ~/.claude/skills/ or inside plugin directories. At invocation time, the plugin framework injects the skill content into the system prompt. That means the skill authoring problem is essentially a prompt engineering problem with structure: what goes in frontmatter (trigger conditions, tool allowlists, model hints), what goes in the body (instruction logic), and how to write for progressive disclosure so early sections give the model enough context to self-route correctly.

claude-skill-scout makes the extraction systematic. When it finds a skill worth borrowing, it parses three layers: the trigger semantics (when does this skill fire?), the mechanism (what reasoning or action pattern does it encode?), and the style (prose conventions, formatting choices). Only the first two transfer. The third gets replaced to match my local conventions and tone.

The "runtime constraints" check matters more than it sounds. A skill written for an environment with filesystem access doesn't port cleanly to one without it. A skill assuming a specific MCP server is available fails silently if that server isn't registered. Mapping to local constraints means validating tool availability, checking hook execution order, and confirming token budget headroom before the skill gets committed to the library.

Agent MRI

~/.openclaw/skills/openclaw-anatomy

When things go wrong, most people just poke random files and hope.

openclaw-anatomy is about understanding the system's internals so debugging becomes intentional:

what counts as runtime truth,
where config and behavior are actually resolved,
where version mismatch failure modes come from,
and where to inspect first.

If you understand your assistant's anatomy, you stop guessing and start diagnosing.

Under the hood: OpenClaw already has access to its own runtime state through tools — it can read live process state, config values, session logs, channel status. What openclaw-anatomy adds is a curated snapshot of its own matching source code as context alongside that runtime data.

This is the key design insight: source code as a diagnostic mirror. At any given moment, OpenClaw can compare "what the code says should happen" against "what is actually happening right now." It reads the relevant source paths — gateway startup, channel routing, config resolution, extension loading — and reasons about expected behavior from first principles, then cross-references against the live state it observes through its tools.

The pattern unlocks intentional self-surgery. Because it has the source as reference, it knows which config keys map to which code paths, which files govern which behaviors, and what a correct state looks like before it touches anything. Without the source snapshot, it would be inferring intent from behavior. With it, intent is explicit and the deviation is the diagnostic signal. The surgery becomes targeted: find the gap between what source says should be wired and what runtime shows is wired, then make the minimal change to close it.

Noise Filter, Signal Hunter

~/.openclaw/skills/rss-news-analyzer

I upgraded this with better matching logic by leveraging existing vector memory search.

The practical effect is lower noise and higher relevance:

better candidate selection,
stronger matching against active interests,
and fewer "technically related but useless" items.

I think this matters more than people realize. Retrieval quality often becomes the hidden ceiling of agent quality.

Under the hood: The previous version used keyword matching — a feed item was "relevant" if it shared surface terms with tracked topics. That approach has a well-known failure mode: high recall, low precision. You get everything tangentially related, which still requires manual filtering.

The upgrade routes candidate selection through shared retrieval infrastructure. Feed items (title + description, sometimes full content) are embedded at fetch time and stored as vectors. "Active interests" — extracted from recent conversation history, a pinned interest profile, or both — are embedded at query time. Relevance becomes cosine similarity in embedding space rather than term overlap.

The practical difference: semantic matching handles paraphrase naturally. An item about "LLM inference latency" matches an interest in "making models faster" even without shared keywords. It also handles topic adjacency — a story about GPU memory constraints surfaces when you're researching context window limits, because the embedding space clusters related concepts even without explicit co-occurrence. The "technically related but useless" false positives drop because the embedding captures meaning, not just vocabulary.

The key implementation detail is that this analyzer doesn't need its own embedding pipeline. It piggybacks on the existing vector store infrastructure, which means the operational cost is just new writes at fetch time and a similarity query at match time — no new models, no new storage layer, less point of failure, less maintenance.

Tool Air-Traffic Control

~/Workspace/openclaw-mcp-router

This project was inspired by Anthropic's write-up on advanced tool use and the Tool Search Tool idea:

https://www.anthropic.com/engineering/advanced-tool-use

As tool ecosystems grow, dumping every tool definition into context becomes inefficient. The router approach lets me expose more MCP tools while keeping the active context lean.

So the assistant can discover and route to tools more dynamically, instead of carrying everything all the time.

And yes, this is now published on NPM:

https://www.npmjs.com/package/openclaw-mcp-router

That publication matters to me because it turns a local experiment into something reusable by others.

Under the hood: MCP (Model Context Protocol) uses JSON-RPC 2.0 over stdio or SSE transport. Each tool definition carries a name, a description, and a JSON Schema for inputSchema. The problem at scale: with 58 registered tools, you're burning roughly 77k tokens just on tool definitions before any actual context or conversation lands in the window. That's not theoretical overhead — that's the difference between fitting a full research session in context and getting cut off.

The router is a meta-MCP server that sits in front of N downstream MCP servers. It registers two meta-tools with the host: mcp_search(query) and mcp_call(tool_name, params_json). When the agent needs to do something, it calls mcp_search with a natural language description of what it wants. The router embeds that query, runs it against pre-embedded tool descriptions from all registered downstream servers, and returns the top-k matching tool definitions — schemas included — as the response. The agent then routes execution through mcp_call, which resolves the owning server from the registry, opens a fresh connection, executes the tool, and returns the result. Schemas are only loaded when needed, and execution stays fully proxied through the router.

This is the Tool Search Tool pattern from Anthropic's writeup, implemented as a composable proxy rather than a monolithic server. Publishing it to NPM means anyone can configure it as a stdio MCP server in .mcp.json and point it at their existing tool set — no code changes to the downstream servers, no modifications to the host client. The context footprint drops from ~77k tokens to ~8.7k tokens — a 95% reduction.

The tradeoff worth naming: you add one round-trip per novel tool invocation. In practice this is negligible because most agentic sessions reuse a small set of tools for the bulk of their work. The context savings on a tool-heavy setup far outweigh the latency cost of one extra tool call.

The "Private" folder became idea infrastructure

One more piece completed the loop: my _private research folder is indexed directly as vectors in memory.

So my collected articles aren't dead bookmarks anymore. They become retrievable context the agent can reason over.

More importantly, this memory layer starts to function like a company knowledge base or a personal operating system for taste. It encodes domain knowledge, preferences, style, and long-term priorities — which sets the tone for how the agent should reason and act going forward.

That enables three useful behaviors:

pull relevant prior research at the right moment,
connect ideas across different sources and time windows,
and propose new ideas grounded in my own reading history.

Under the hood: The indexing pipeline runs a file watcher on _private/. On change, new or modified documents get chunked — roughly 2000-token windows with ~10% overlap to preserve context across chunk boundaries — then each chunk is embedded and upserted into the vector store. The overlap is deliberate: without it, a key sentence that falls at a chunk boundary might get semantically orphaned from the paragraph that gives it meaning.

At query time, the agent embeds the current question, runs a similarity search, and injects the top-k chunks as context. The chunking strategy is where most of the implementation nuance lives. Too large a chunk (say, full articles) dilutes the embedding signal because the vector must represent too many topics at once. Too small (individual sentences) loses surrounding context that disambiguates meaning. The 2000-token-with-overlap approach is a common starting point that works well for long-form research notes.

The "idea generation" behavior that makes this valuable is a direct consequence of retrieval scope. When you ask about topic X, the agent pulls not just your most recent note on X but semantically adjacent notes from months ago — things you read and filed away, connections you never consciously made. The synthesis happens at inference time over your own accumulated reading. That's qualitatively different from an agent answering from training data: it's reasoning over your specific intellectual history, not the average of the internet.

This is where I see the bigger transformation: not just answer generation, but idea generation from accumulated personal research.

What this says about the agentic trend

For me, "agentic" is not mainly about autonomous loops.

It's about systems that can:

learn from external patterns,
introspect and debug themselves faster,
improve relevance through memory-aware retrieval,
and expand tool access without context collapse.

The common thread across all four of these builds is the same architectural principle: keep the runtime lean, make capabilities composable, and let quality compound over time. Skills are lazily injected. Tools are dynamically discovered. Memory is retrieved by relevance, not loaded wholesale. Config resolution is explicit and inspectable.

That's the engineering discipline underneath "self-evolution." It's not magic. It's just treating your agent infrastructure with the same design care you'd give any production system.

In short: self-evolution with maintainability.

What I'm building next

The next phase is about two things: perception breadth and memory efficiency.

Perception: better inputs, better horizons

I plan to expand beyond RSS into curated Substack, Reddit, Medium, and other high-signal sources.

The design principle is simple: my human judgment acts as a value function. I filter what is worth keeping, and the agent reasons over that curated stream instead of raw information noise.

That gives the system better long-term signal and helps it "see" farther through a more intentional lens.

Memory: from archive to active substrate

As the corpus grows, retrieval quality and storage efficiency matter more than raw volume.

One direction I'm exploring is integrating ideas from MemOS:

Repo: https://github.com/MemTensor/MemOS
Paper: https://arxiv.org/abs/2507.03724
PDF: https://arxiv.org/pdf/2507.03724

If this works, memory becomes more than a passive archive. It becomes active infrastructure for planning, synthesis, and adaptation.

I stopped prompting and started building.

Comment section — Miko's take 🧠

The architecture is strong, but the hard part starts now: discipline.

Three risks to watch:

Curation bias: if _private gets too narrow, I might become sharper but less diverse in perspective.
Memory drift: as the corpus scales, retrieval quality can decay unless indexing and evaluation stay rigorous.
Complexity creep: more tools and routing layers increase capability, but they also increase hidden fragility.

The direction is still right. But self-transformation only works if we keep measuring quality, pruning aggressively, and treating this as a production system instead of a demo.