DRAFT

From the team

Why we replaced our RAG pipeline with Agent Skills

How a radically simpler architecture matched a hand-tuned RAG pipeline while cutting latency and tokens along the way

June 12, 2026

At Hebo we provide the tools to build high-quality conversational agents that businesses run in production. Each agent answers from a company-specific knowledge base: policies, product catalogs, pricing, processes, whatever its job requires.

For the past years, that knowledge layer was built on a classic RAG pipeline. We chunked the knowledge base, embedded it, retrieved the most similar chunks at query time, and injected them into the prompt. It worked well enough, but every time we touched that pipeline we were reminded of how many moving parts it had, and how much latency and complexity each of them added.

So we decided to try replacing it with Agent Skills.

This post is about why we made the switch, how we validated it, and what we measured along the way.

Three models before every answer

While A RAG pipeline may look simple on paper, answering a single user message required three sequential LLM calls before a reply could even start streaming:

RAG pipeline
                    ┌───────────────┐   ┌───────────────┐   ┌──────────────┐
User message ──────▶│  Rewrite LLM  │──▶│   Embedding   │──▶│  Generation  │──▶ Reply
                    │  (standalone  │   │    model +    │   │     LLM      │
                    │    question)  │   │ vector search │   │              │
                    └───────────────┘   └───────────────┘   └──────────────┘
  1. A rewrite LLM rephrases the conversation history plus the latest message into a standalone question (retrieval works better on single queries than multi-turn threads).
  2. An embedding model converts that question into a vector, and we search the knowledge base for the most similar chunks.
  3. The generation LLM finally answers, with the retrieved chunks injected into the prompt.

This translates into waiting for each one of the LLM calls to complete, where each can potentially fail. And if the retrieval misses the correct chunk, the generation model can't recover, because it does not know what it can't see.

Let the agent decide what it needs

Agent Skills are an open format (originally developed by Anthropic) for packaging knowledge and instructions into folders that an agent loads on demand. A skill is just a markdown file with a name, a description, and content.

The idea is based on progressive disclosure. The agent's system prompt contains only a lightweight index with the name and a one-line description of each skill, and when the conversation calls for it, the agent reads the full skill via a tool call.

That simplifies the whole pipeline into a single model with one tool:

Skills pipeline
                    ┌───────────────────────────────┐
User message ──────▶│        Agent LLM              │──▶ Reply
                    │   (skill index in prompt,     │
                    │ `read` the skill when needed) │
                    └───────────────────────────────┘

The retrieval step is now inside the model. Instead of us guessing what context the model needs via cosine similarity, the model decides for itself whether it needs more context and which skill to load.

Operationally the difference is large. A RAG pipeline is infrastructure: embeddings to keep in sync, chunking strategies to tune, retrieval parameters to watch. A skills setup is a folder of markdown files and one tool definition.

The reality check

Simpler architecture means nothing if reply quality drops. These agents talk to real customers on behalf of real businesses, where a wrong price or a hallucinated store policy is not an acceptable trade for fewer moving parts.

So we ran a blind A/B experiment. Both arms shared the same knowledge base content, the same external tools, and the same model. Human evaluators ran realistic e-commerce scenarios (product inquiries, order questions, returns, promotions) without knowing which architecture they were talking to, and scored each conversation on correctness, policy alignment, helpfulness, and communication, plus a separate holistic overall score, all on a 1-5 scale.

There's one asymmetry worth being upfront about. The RAG knowledge base had been human-curated and tuned in production for months, while the skills version was an AI-assisted rewrite of that same content, iterated only a couple of times. RAG entered the experiment as the incumbent in the best possible shape.

Quality

Across the blind-evaluated sessions:

DimensionRAGSkills
Correctness4.774.27
Policy alignment4.424.21
Helpfulness4.614.24
Communication4.483.91
Overall4.293.94

RAG kept an edge of 0.35 points overall at the time of the experiment. Given the curation asymmetry, that's roughly what you'd expect, and it cleared the bar we had set going in: skills landing in the same quality band as RAG rather than beating it.

There is more on this. Splitting the experiment into its early and late phase, the skills overall score climbed from 3.77 to 4.05 as we iterated on the skill content, with every dimension improving. The gap was closing with ordinary content work, not architectural changes. Today, after another month of optimizations on the skill content, skills now score higher than RAG did, which confirmed what we suspected: the quality ceiling depends on how good the content is, not which retrieval mechanism serves it.

Speed and tokens

Running both arms on identical scenarios:

Per turnRAGSkills
Median latency15.0s11.8s
Mean latency16.7s13.4s
Avg prompt tokens~67k~31k

Skills replied about 20% faster while consuming less than half the prompt tokens, and that's on text-only scenarios. Conversations involving media skip an entire model call in the skills arm, so the gap there is likely wider.

There's an honest footnote to these numbers. Our first skills version actually used more tokens than RAG (around 90k per turn) because the index was poorly organized and the agent was loading too many skills. Restructuring it with clearer names, sharper descriptions, and better-scoped files brought consumption down to half of RAG's footprint. With skills, the index is the retrieval system, and it deserves the same care you'd put into a chunking strategy.

Why we migrated

We ended up with quality that now exceeds the hand-tuned RAG baseline, lower latency, half the tokens, and one model in the critical path instead of three.

Skills fit this use case particularly well. The customer we ran the experiment for uses Hebo for support and outbound sales over messaging, and their knowledge base is compact: fewer than 50 sections, each covering a focused topic. What used to be a RAG chunk is now a skill. At that scale, the entire skill index fits comfortably in the system prompt, and the agent reliably picks the right skill to load.

RAG still has its place. If your knowledge base has thousands of documents, listing a title and description for each one stops being an index and starts drowning the model in metadata before the conversation even begins. At that scale, vector retrieval over content the model could never enumerate is exactly the right tool. For compact, well-structured knowledge bases, RAG was more machinery than the problem required.

What's next

If you're building conversational agents on a knowledge base that fits in a few dozen well-named files, try skills before you reach for a vector database. You might find, as we did, that the simplest architecture that could possibly work… works.

← All blog posts