Why we replaced our RAG pipeline with Agent Skills

June 16, 2026

At Hebo we provide the tools to build high-quality conversational agents that businesses run in production. Each agent answers from a company-specific knowledge base: policies, product catalogs, pricing, processes, whatever its job requires.

For the past years, that knowledge layer was built on a classic RAG pipeline. We chunked the knowledge base, embedded it, retrieved the most similar chunks at query time, and injected them into the prompt. It worked well enough, but every time we touched that pipeline we were reminded of how many moving parts it had, and how much latency and complexity each of them added.

This post is about why we replaced it, how we validated the replacement, and what we measured.

Three models before every answer

A RAG pipeline may look simple on paper, but answering a single user message requires three sequential LLM calls before a reply can even start streaming:

RAG pipeline

                    ┌───────────────┐   ┌───────────────┐   ┌──────────────┐
User message ──────▶│  Rewrite LLM  │──▶│   Embedding   │──▶│  Generation  │──▶ Reply
                    │  (standalone  │   │    model +    │   │     LLM      │
                    │   question)   │   │ vector search │   │              │
                    └───────────────┘   └───────────────┘   └──────────────┘

A rewrite LLM step rephrases the conversation history plus the latest message into a standalone question (retrieval works better on single queries than multi-turn chats).
An embedding model converts that question into a vector, and we search the knowledge base for the most similar chunks.
The generation LLM finally answers, with the retrieved chunks injected into the prompt.

If each LLM call takes 2–5 seconds to complete, that's 4–10 seconds of overhead before the generation model can start streaming. This happens on every turn, including greetings and follow-up questions where the answer is already in the conversation and no retrieval is needed. The pipeline also injects retrieved chunks into the prompt regardless of whether they're relevant. In our case that meant prompts averaging ~67k tokens, a large portion of which the generation model never used. And if the retrieval misses the correct chunk, the generation model can't recover, because it does not know what it can't see.

Let the agent decide what it needs

Agent Skills are a technique (originally developed by Anthropic) for packaging knowledge and instructions into files that an agent loads incrementally on demand. In practice, a skill is a markdown file with a name, a description, and content.

The idea is based on progressive disclosure. The agent's system prompt contains only a lightweight index with the name and a one-line description of each skill, and when the conversation calls for it, the agent reads the full skill via a tool call.

That simplifies the whole pipeline into a single model with one tool:

Skills pipeline

                    ┌───────────────────────────────┐
User message ──────▶│        Agent LLM              │──▶ Reply
                    │   (skill index in prompt,     │
                    │ `read` the skill when needed) │
                    └───────────────────────────────┘

The retrieval step is now inside the model, so instead of us guessing what context it needs via cosine similarity, the model decides for itself whether it needs more context and which skill to load.

A RAG pipeline is infrastructure: embeddings to keep in sync, chunking strategies to tune, retrieval parameters to watch. A skills setup is a folder of markdown files and one tool definition.

From chunks to hierarchy

Because the model navigates deliberately rather than relying on similarity matching, you can also structure content differently. In our RAG setup, related content had to be split into 2–3 hand-crafted chunks. A topic might have a generic introduction in one chunk and two orthogonal scenarios in separate chunks, each describing how to handle a different case. Whether the agent saw all three depended on whether cosine similarity surfaced them independently from the user's message.

With skills, that same content lives under a single parent skill. The introduction is there, and hyperlinks point to sub-skills covering each scenario. These sub-skills are not listed in the top-level index. The agent only discovers them when it loads the parent and decides it needs to go deeper. Related content stays connected, and the agent navigates the hierarchy rather than hoping a similarity search retrieves each piece independently.

The reality check

A simpler architecture is worth nothing if reply quality drops, and these agents talk to real customers on behalf of real businesses. We can't accept a wrong price or a hallucinated store policy in exchange for fewer moving parts.

So we ran a blind A/B experiment. Both arms shared the same knowledge base content, the same external tools, and the same model. Human evaluators ran realistic e-commerce scenarios (product inquiries, order questions, returns, promotions) without knowing which architecture they were chatting with, and scored each conversation on correctness, policy alignment, helpfulness, and communication, plus a separate holistic overall score, all on a 1-5 scale.

Quality

Across the blind-evaluated sessions:

Dimension	RAG	Skills
Correctness	4.77	4.27
Policy alignment	4.42	4.21
Helpfulness	4.61	4.24
Communication	4.48	3.91
Overall	4.29	3.94

RAG kept an edge of 0.35 points overall. One asymmetry worth noting is that on Hebo, RAG knowledge bases go through several human-curated iterations and are tuned in production for months, while the skills versions were AI-assisted rewrites of that same content, iterated only a couple of times. RAG entered the experiment as the incumbent in the best possible shape, so the gap is roughly what you'd expect. What mattered to us was skills landing in the same quality band as RAG rather than falling behind.

Splitting the experiment into its early and late phase, the skills overall score climbed from 3.77 to 4.05 as we iterated on the skill content, with every dimension improving. The gap was closing with ordinary content work, not architectural changes.

Speed and tokens

Running both arms on identical scenarios:

Per turn	RAG	Skills
Median latency	15.0s	11.8s
Mean latency	16.7s	13.4s
Avg prompt tokens	~67k	~31k

Skills replied about 20% faster. On tokens, we went through an iteration: our first skills versions actually used more than RAG (around 90k per turn) because the indexes were poorly organized and the agents were loading too many skills. Restructuring them with clearer names, sharper descriptions, and better-scoped files brought consumption down to the ~31k you see in the table, less than half of RAG's footprint.

With skills, the index is the retrieval system, and it deserves the same care you'd put into a chunking strategy.

Where skills stand today

Skills fit well here because these agents handle support and outbound sales over messaging, and their knowledge bases are well-structured KBs of 100–250 pages each, covering focused topics. What used to be a RAG chunk is now a skill. At that scale, the entire skill index fits comfortably in the system prompt, and the agents reliably pick the right skill to load.

After another month of content optimizations following the experiment, skills now score 4.53 overall, compared to the RAG baseline's 4.29, beating it on every single dimension:

Dimension	RAG	Skills (latest)
Correctness	4.77	4.40
Policy alignment	4.42	4.67
Helpfulness	4.61	4.47
Communication	4.48	4.67
Overall	4.29	4.53

We ended up with higher quality, lower latency, half the tokens, and one model in the critical path instead of three. The quality ceiling turned out to depend on how good the content is, not on which retrieval mechanism serves it.

RAG still has its place. If your knowledge base has thousands of documents, listing a title and description for each one stops being an index and starts drowning the model in metadata before the conversation even begins. At that scale, vector retrieval over content the model could never enumerate is exactly the right tool. For well-structured knowledge bases of 100–250 pages, RAG was more machinery than the problem required.

What's next

If you're building conversational agents on a well-structured knowledge base of 100–250 pages, try skills before you reach for a vector database. You might find, as we did, that the simplest architecture that could possibly work… works.