RAG Systems with Claude: From Documentation to Production
Build production-grade RAG systems using Claude and vector search. A step-by-step guide to document retrieval, embedding, and cost optimization.
The problem: how do you give Claude your company's knowledge?
Claude has a 200K token context window — it can hold an entire book. But what if you need to:
- Answer questions about docs that change monthly
- Search across thousands of documents efficiently
- Stay up-to-date without retraining
- Control costs (processing 10MB of text every request is expensive)
A naive approach: throw everything into the prompt. This fails because:
- You can't practically include all documents
- Costs explode as documents grow
- Irrelevant context confuses the model
- Updates require new deployments
This is the RAG problem: retrieval-augmented generation.
The solution: retrieve relevant context, then generate
RAG works in two steps:
1. RETRIEVAL: User asks a question
→ Search your documents for relevant context
→ Return top-5 most similar passages
2. GENERATION: Feed Claude the question + retrieved context
→ Claude answers based on your documents
→ Returns answer with citationsThis is powerful because:
- Only relevant documents are processed (low cost)
- Your docs can be updated independently
- Answers are grounded in your knowledge
- Fully auditable (you see which docs were used)
How Techcologic builds RAG systems
We use a three-layer architecture.
Layer 1: Embedding & vector search
Step 1: Chunk your documents into passages (500–1000 tokens each).
Document: "Claude API Overview.pdf" (200 pages)
→
Chunks: [
"Claude is a large language model trained by Anthropic...",
"To use Claude, you need an API key from...",
"The Claude model family includes Opus, Sonnet, and Haiku...",
... (200+ chunks)
]Step 2: Convert chunks to embeddings.
Embedding Service: text-embedding-3-small (or another embedding model)
Chunk: "Claude is a large language model..."
→
Vector: [0.123, -0.456, 0.789, ..., 0.234] (1536 dimensions)Step 3: Store in a vector database.
Database: pgvector (PostgreSQL + vector extension)
OR Pinecone, Weaviate, Milvus (cloud)
Table: documents
├─ id: chunk_id
├─ text: "Claude is a..."
├─ vector: [embeddings]
├─ source: "Claude API Overview.pdf"
└─ updated_at: 2024-06-15Layer 2: Retrieval on query
When a user asks a question:
# 1. Embed the user's question
user_question = "How do I use Claude with streaming?"
query_vector = embed_model.embed(user_question)
# 2. Find similar documents in your database
similar = vector_db.search(
query_vector,
top_k=5,
min_similarity=0.7
)
# 3. Result: Top-5 passages from your docs
retrieved = [
{
"text": "Claude supports streaming via server-sent events...",
"source": "API Guide.pdf",
"similarity": 0.94
},
... (4 more)
]Layer 3: Generation with Claude
# Construct the augmented prompt
prompt = f"""
Use the following context from Techcologic documentation:
{retrieved_context}
User question: {user_question}
Answer the question using ONLY the context above.
If the answer isn't in the context, say: "I don't have information on this."
Include citations: (Source: document_name)
"""
# Call Claude with your knowledge
response = claude.message(prompt, max_tokens=500)Real example: internal knowledge base
Scenario: Techcologic's 50-page engineering handbook, constantly updated.
Without RAG:
- Include the entire handbook in every prompt (150K tokens)
- Cost: $2.25 per query (expensive!)
- Fails when the handbook exceeds the context window
With RAG (the Techcologic approach):
- Store handbook chunks in a vector database
- Retrieve only relevant sections per query (2–5K tokens)
- Cost: $0.03 per query (75× cheaper!)
- Handbook can grow unlimited
Comparison:
| Approach | Cost per query | Latency | Scalability | Updates |
|---|---|---|---|---|
| Naive (full context) | $2–5 | 5–10s | Limited to token window | Requires redeploy |
| RAG with pgvector | $0.02–0.05 | 1–2s | Unlimited docs | Instant |
| RAG + caching | $0.005–0.01 | <500ms | Unlimited docs | Instant |
Building RAG step by step
Step 1: Prepare documents
1. Collect your documents (PDFs, Markdown, text)
2. Extract text (PyPDF2, pdfplumber for PDFs)
3. Chunk into 500-1000 token pieces
4. Store in a database with metadataStep 2: Set up the vector database
Option A: PostgreSQL + pgvector (self-hosted)
Option B: Pinecone (serverless)
Option C: Weaviate (open-source)
We recommend pgvector for most teams — it's cheap, reliable, debuggable.Step 3: Embed & index
from anthropic import Anthropic
# Embed each document chunk
embeddings = model.embed(chunks)
# Store in the vector DB
vector_db.insert(chunks, embeddings, metadata)Step 4: Build the retrieval function
def retrieve_context(question: str, top_k: int = 5):
query_vector = embed_model.embed(question)
results = vector_db.search(query_vector, top_k)
return [r.text for r in results]Step 5: Create the answer function
def answer_question(question: str):
context = retrieve_context(question)
prompt = f"""Context: {context}
Question: {question}
Answer:"""
response = claude.message(prompt, max_tokens=500)
return responseCommon pitfalls (and how to avoid them)
| Problem | Cause | Solution |
|---|---|---|
| Low-quality answers | Irrelevant documents retrieved | Improve chunking strategy, increase similarity threshold |
| High costs | Too many tokens sent to Claude | Optimize chunk size, retrieve fewer docs, use caching |
| Stale answers | Documents never updated | Set up automated sync, monitor freshness |
| Hallucination | Model invents info not in docs | Use a system prompt: "Only answer from provided context" |
Techcologic's RAG stack
For production systems, we use:
Documents → Chunking (LangChain)
→ Embedding (text-embedding-3-small)
→ Storage (pgvector on RDS)
→ Retrieval (vector similarity search)
→ Generation (Claude API)
→ Monitoring (LangSmith, custom logging)Result: production RAG systems that handle millions of queries, stay accurate, and cost under $0.02 per question.
Getting started today
If you're building with Claude and need to ground answers in your documents:
- Start small — Pick 5–10 important docs
- Chunk them — 500-token pieces
- Embed them — Use a dedicated embedding model
- Store them — PostgreSQL + pgvector (free tier available)
- Test retrieval — Verify top-5 results make sense
- Add Claude — Build the augmented prompt
- Monitor — Track retrieval quality and token usage
This takes a weekend to prototype and a few days to production.
Ready to ship RAG? Book a Claude architecture call with Techcologic.
Key takeaways
- RAG lets you augment Claude with your documents
- Vector search finds relevant context in milliseconds
- Costs drop 10–100× vs. naive approaches
- Production RAG systems are reliable and maintainable