Claude RAG Systems

RAG pipelines that make Claude answer from your knowledge, not its training

Retrieval-Augmented Generation lets Claude answer questions from your specific knowledge base rather than generic training data. The architecture is well-understood. The execution. chunking strategy, embedding model selection, retrieval scoring, hybrid search configuration, and citation implementation. determines whether your RAG system produces trustworthy, grounded answers or plausible-sounding hallucinations. We built and run the seo_query_kb RAG system that powers our fine-tuned Qwen3.5-27B model, delivering 100% citation rate across all knowledge base queries.

100%
Citation rate on seo_query_kb queries
5-6
Tok/s on our Qwen3.5-27B fine-tuned model
4
Knowledge sources indexed in our RAG system
0
Uncited claims in KB-grounded answers

What We Build

Six components of a production-grade Claude RAG system.

A RAG system that works in a demo and a RAG system that works reliably in production require different decisions at every layer.

Vector Database Architecture

Supabase pgvector is our primary recommendation for teams already on Postgres. It puts vector search inside your existing database, eliminating a separate vector store with its own auth, scaling, and operational overhead. We design the schema: the documents table, the embeddings column, the HNSW or IVFFlat index configuration, and the similarity search functions. For teams with specific requirements, we also build on Pinecone, Weaviate, or Chroma.

  • Supabase pgvector schema design
  • HNSW index configuration for performance
  • Similarity search function optimization
  • Multi-tenant document isolation

Document Processing Pipeline

How you chunk your documents determines retrieval quality more than any other factor. Chunks that are too large retrieve too much context. Chunks that are too small lose the semantic coherence that makes retrieval work. We design chunking strategies specific to your document type: markdown documents, PDFs, structured data, code files, and web content each have different optimal chunking patterns.

  • Document-type-specific chunking strategies
  • Semantic coherence preservation
  • Metadata extraction and indexing
  • Incremental update pipeline design

Hybrid Search Implementation

Pure vector search misses exact keyword matches. Pure keyword search misses semantic equivalents. Hybrid search combines both. a weighted combination of cosine similarity (vector) and BM25 or Postgres full-text search (keyword). The right weighting depends on your query patterns. We implement hybrid search, tune the weighting against your real queries, and measure the retrieval quality improvement over single-mode search.

  • Cosine similarity + Postgres full-text combination
  • Weight tuning against representative queries
  • Retrieval quality measurement
  • Query-time relevance scoring

Citation Integration

A RAG system without citations is a hallucination risk. When Claude cannot see what it is citing, it sometimes fills gaps with training data instead of retrieved content. We implement Claude's native citations feature alongside our RAG retrieval layer: Claude receives retrieved chunks with source metadata and is instructed to cite every factual claim. Our seo_query_kb system achieves 100% citation rate on all knowledge base queries using this pattern.

  • Anthropic citations API integration
  • Source metadata preservation through chunking
  • Citation formatting for different output types
  • 100% citation enforcement via system prompt

Access Control and Document Governance

Multi-tenant RAG systems need to ensure that queries from one user or team only retrieve documents they are authorized to access. We design the access control layer: document-level permission tags, query-time permission filtering, and audit logging for every retrieval operation. For regulated industries, we add data residency controls and retrieval audit trails.

  • Document-level permission tagging
  • Query-time permission filtering
  • Retrieval audit logging
  • Data residency compliance controls

Retrieval Quality Evaluation

Retrieval quality is measurable. For a given test question, did the system retrieve the chunk that actually contains the answer? We build retrieval eval frameworks: a ground-truth set of questions with known source documents, precision and recall measurement against that ground truth, and a monitoring pipeline that runs the eval as your document corpus evolves. You should know your retrieval quality number, not just trust that it works.

  • Ground-truth retrieval test set design
  • Precision and recall measurement
  • Retrieval quality dashboarding
  • Ongoing monitoring as corpus changes

How We Build RAG Systems

Four stages from document audit to citation-grounded production system.

Step 01

Knowledge base audit and architecture design

We review your document corpus: volume, document types, update frequency, and query patterns. This drives every design decision: chunking strategy, embedding model choice, index configuration, and whether hybrid search adds enough value for your query distribution to justify the additional complexity.

Step 02

Chunking and embedding pipeline build

We build the document processing pipeline: ingestion, chunking, embedding generation, metadata extraction, and index insertion. For document types you already have, we run the first full ingestion and validate chunk quality by reviewing the retrieved results on a sample of representative queries.

Step 03

Retrieval optimization and hybrid search tuning

We build the retrieval layer, implement hybrid search if the query distribution warrants it, and tune the weighting against your representative queries. Retrieval quality is measured against a ground-truth set before we connect Claude to the system.

Step 04

Citation integration and production deployment

We connect Claude to the retrieval layer, implement citations, and validate the end-to-end system on your actual queries. The deployment includes monitoring for retrieval quality, citation rate, and response latency. Documentation covers how to add new documents, modify the chunking strategy, and debug retrieval failures.

seo_query_kb in Production

How we built a RAG system that achieves 100% citation rate on 15 daily client workflows.

The Challenge

Our team needs to answer SEO questions that are grounded in specific, citable research: Ahrefs data studies, Moz methodology documentation, Google Search Central guidance, Search Engine Journal analysis. Generic Claude responses, while often correct, lack the citation accountability that professional SEO advice requires. We needed a system where every factual claim pointed to a specific source.

Our Solution

We built seo_query_kb: a RAG system with Supabase pgvector backing a corpus from Ahrefs, Backlinko, Moz, Search Engine Journal, and Google Search Central. The system uses hybrid search (vector similarity plus keyword matching) to retrieve the most relevant chunks, feeds them to our fine-tuned Qwen3.5-27B model running at 5-6 tok/s on an M4 Pro, and enforces citation via Anthropic's citations API plus a system prompt that blocks any uncited factual claim. The system is exposed via an MCP tool called seo_query_kb, accessible to Claude in every client session.

Results Achieved

100%
Citation rate on all queries
Enforced at system prompt level
4
Knowledge sources indexed
Ahrefs, Backlinko, Moz, Google SC
5-6 tok/s
Inference speed on fine-tuned model
Qwen3.5-27B on M4 Pro
Every session
Client sessions using seo_query_kb
Across all 15 clients

Our RAG Stack

The exact stack powering seo_query_kb with 100% citation rate.

We do not recommend tools we have not used in production. This is the stack we run.

Data Layer

  • Supabase pgvector for vector storage
  • Postgres full-text for keyword search
  • HNSW index for approximate nearest neighbor
  • Document metadata in structured columns

Processing Layer

  • Document chunking by semantic unit
  • Embedding generation with text-embedding models
  • Metadata extraction and preservation
  • Incremental index update on new documents

Retrieval Layer

  • Hybrid search: vector + keyword combination
  • Query-time permission filtering
  • Relevance re-ranking before context injection
  • Context window budget management

Generation Layer

  • Claude with retrieved chunks in context
  • Anthropic citations API for source attribution
  • System prompt enforcing citation requirement
  • Confidence signaling on low-retrieval queries

FAQ

Claude RAG system frequently asked questions

RAG and fine-tuning solve different problems. Fine-tuning improves how a model reasons and responds. RAG gives the model access to specific, current information it was not trained on. For most business knowledge base use cases, RAG is the right choice: your documents change, your model weights should not need to change every time they do. We actually use both in our stack. a fine-tuned model for reasoning style, RAG for knowledge retrieval.
Supabase pgvector puts vector search in the same database as your other application data. One auth system, one operational footprint, one query interface. For most teams, this simplicity outweighs the marginal performance advantages of dedicated vector databases at their scale. If you are searching hundreds of millions of vectors with sub-10ms latency requirements, the calculus changes. Most business knowledge bases are not in that category.
Hybrid search combines vector similarity search (which finds semantically similar content) with keyword search (which finds exact term matches). Vector search alone misses exact terms: a query for "BM25" may not retrieve documents with a high vector similarity score because the term is specific. Keyword search alone misses semantic equivalents: "click-through rate" and "CTR" would not match. Hybrid search covers both. It adds complexity but meaningfully improves retrieval quality for knowledge bases with specific terminology.
Three mechanisms working together: first, the system prompt instructs Claude to base every factual claim on retrieved content and cite the source. Second, we implement Anthropic's citations API, which ties Claude's output to specific retrieved chunks. Third, the system prompt instructs Claude to express uncertainty or decline to answer when retrieved content does not cover a query, rather than filling the gap with training data. The combination produces 100% citation rate in our own system.
We build an incremental update pipeline: new documents are split into chunks, converted to embeddings, and written to the index without requiring a full re-index of the existing corpus. For frequently updated documents, we implement a document version system that soft-deletes old chunks when new versions are ingested. You can add documents to your knowledge base without downtime or complex maintenance procedures.

Ready to Build Your Knowledge Base?

Let's design your RAG system.

Document audit, architecture design, retrieval pipeline build, and citation integration. Your knowledge base, accessible to Claude, with 100% grounded answers.

  • Document corpus audit and chunking strategy included
  • Hybrid search tuned to your query patterns
  • Citation integration for grounded, auditable outputs