
That tedious cycle of crafting great prompts, then losing them, then starting over? There's finally a better way. Claude Skills and subagents are fundamentally changing how we build with AI—and the token savings alone will shock you.
You craft the perfect prompt after 20 minutes of iteration. It works brilliantly. Three days later, you need that same behavior again, so you... start prompting from scratch, because who remembers where they saved that template? Welcome to what I call the prompt engineering hamster wheel—and it's a fundamentally broken workflow.
Claude Skills are Anthropic's answer to this reusable prompt problem, but they're much more than glorified prompt storage. They introduce a fundamentally different approach to context management, token economics, and AI workflow architecture that could save you hundreds of dollars monthly while dramatically improving response quality.
Every token in your AI's context window costs you in three compounding ways:
Let's put real numbers on this. A typical MCP setup for development work might include:
That's roughly 32,000 tokens of tool metadata loaded into every single message, whether you use those tools or not. At Claude Opus 4.6's $5 per million input tokens, those idle MCP descriptions add $0.16 to every message. Send 50 messages daily over a 20-day work month? That's $160/month in pure overhead—before you account for the latency and quality impacts.
The fundamental problem isn't the tools themselves—it's the eager loading pattern that dumps everything into context upfront, regardless of relevance.
At their core, Claude Skills are reusable instruction sets that AI agents can automatically access when relevant to a conversation. You write a skill.md file with metadata and instructions, drop it into a .claude/skills/ directory, and Claude handles the rest.
Skills use progressive disclosure across three distinct levels, each with its own context budget:
Metadata (loaded at startup): Skill name (max 64 characters) and description (max 1,024 characters). Costs roughly ~100 tokens per skill—negligible overhead even with hundreds registered.
Skill body (loaded on invocation): The full instruction set inside skill.md, up to ~5,000 tokens. Only enters context when the agent determines the skill is relevant.
Referenced files (loaded on demand): Additional markdown files, folders, or scripts within the skill directory. No practical limit, loaded only when instructions reference them and the current task requires it.
A basic skill looks like this:
---
name: code-reviewer
description: Reviews pull requests following team conventions and security best practices
---
## Code Review Guidelines
When reviewing code:
1. Check for security vulnerabilities first
2. Verify adherence to team style guide in /docs/style.md
3. Ensure proper test coverage
4. Reference related Linear ticket in comments
Key insight: Skills are lazy-loaded context. The agent doesn't consume the full instruction set upfront—it progressively discloses information to itself, pulling in only what's needed for the current step.
The real power isn't just storage—it's automatic relevance detection. When you start a conversation, Claude scans skill metadata to understand what expertise is available. When it detects a skill might be relevant, it loads the full body. If that body references additional files, it reads those too, but only on demand.
This means you could register 300 skills and still consume fewer tokens at startup than a typical MCP setup. In practice, most conversations invoke one or two skills while the rest remain invisible to the context window.
Before we go further, let's clear up a common misconception. Skills aren't "better MCPs"—they're solving different problems:
MCP is an open standard that lets any LLM interact with external applications. Before MCP, connecting M models to N tools required M × N custom integrations. MCP collapses that to M + N: each model implements the protocol once, each tool exposes it once, and they all interoperate.
Skills are "glorified prompts" (in the best possible way). They give agents expertise on how to approach tasks, what conventions to follow, when to use which tools, and how to structure output.
Say you connect GitHub's MCP server to your agent. MCP gives the agent the ability to create pull requests, list issues, and search repositories. But it doesn't tell the agent:
That's what a GitHub workflow skill provides—the playbook for using the tools effectively.
The real insight: MCP's eager loading pattern is the problem, not MCP itself. Skills prove that lazy-loading works. So why can't MCP tool access be lazy-loaded too?
Subagents are specialized child agents with isolated context windows and dedicated tool sets. Two properties make them powerful:
A subagent starts with a clean context window, pre-loaded with its own system prompt and only assigned tools. Everything it reads, processes, and generates stays in its own context. The main agent only sees the final result.
Each subagent gets its own MCP servers and skills. The main agent doesn't know about (or pay for) tools it never directly uses.
Once a subagent finishes its task, its entire context gets discarded. Tool metadata, intermediate reasoning, API responses—all gone. Only the result flows back to the main agent.
Imagine a subagent that researches a library's API. It might:
You pay for the subagent's token usage, but all that intermediate work—the dead ends, irrelevant pages, failed searches—gets discarded. None of it compounds into the main agent's context, so every subsequent message stays clean and cheap.
You can design setups where MCP servers are only accessible through specific subagents, never loaded on the main agent at all. Instead of carrying ~32,000 tokens of tool metadata in every message, the main agent carries nearly zero.
When it needs to open a pull request, it spawns a GitHub subagent with:
The subagent does its work, returns a clean result, and disappears. The main conversation continues without any GitHub-related bloat.
Bottom line: Subagents let you have the MCP tool ecosystem without paying the context window tax on every message.
Here's how to start implementing this approach:
Look through your recent conversations and identify:
Start simple with a high-frequency use case:
---
name: api-documentation
description: Creates comprehensive API documentation following OpenAPI standards
---
## Documentation Standards
When documenting APIs:
1. Use OpenAPI 3.0 format
2. Include example requests/responses
3. Document all error codes
4. Add rate limiting information
5. Reference authentication in /docs/auth-guide.md
Group related tools into specialized subagents:
Track these metrics before and after:
The prompt engineering hamster wheel exists because we've been thinking about AI interactions wrong—as isolated conversations instead of reusable systems. Claude Skills and subagents represent a fundamental shift toward treating AI as infrastructure: modular, efficient, and composable.
Skills solve the reusability problem through lazy-loaded expertise. Subagents solve the context bloat problem through isolated, specialized workers. Together, they can cut your token costs by 80%+ while dramatically improving response quality and development velocity. The real question isn't whether to adopt this architecture—it's how quickly you can migrate your current workflows to take advantage of it.
Rate this tutorial