June 16, 20269 min read

RAG Systems with Claude: From Documentation to Production

Build production-grade RAG systems using Claude and vector search. A step-by-step guide to document retrieval, embedding, and cost optimization.

RAGVector SearchClaude API

The problem: how do you give Claude your company's knowledge?

Claude has a 200K token context window — it can hold an entire book. But what if you need to:

Answer questions about docs that change monthly
Search across thousands of documents efficiently
Stay up-to-date without retraining
Control costs (processing 10MB of text every request is expensive)

A naive approach: throw everything into the prompt. This fails because:

You can't practically include all documents
Costs explode as documents grow
Irrelevant context confuses the model
Updates require new deployments

This is the RAG problem: retrieval-augmented generation.

The solution: retrieve relevant context, then generate

RAG works in two steps:

text

1. RETRIEVAL: User asks a question
   → Search your documents for relevant context
   → Return top-5 most similar passages

2. GENERATION: Feed Claude the question + retrieved context
   → Claude answers based on your documents
   → Returns answer with citations

This is powerful because:

Only relevant documents are processed (low cost)
Your docs can be updated independently
Answers are grounded in your knowledge
Fully auditable (you see which docs were used)

How Techcologic builds RAG systems

We use a three-layer architecture.

Layer 1: Embedding & vector search

Step 1: Chunk your documents into passages (500–1000 tokens each).

text

Document: "Claude API Overview.pdf" (200 pages)
→
Chunks: [
  "Claude is a large language model trained by Anthropic...",
  "To use Claude, you need an API key from...",
  "The Claude model family includes Opus, Sonnet, and Haiku...",
  ... (200+ chunks)
]

Step 2: Convert chunks to embeddings.

text

Embedding Service: text-embedding-3-small (or another embedding model)
Chunk: "Claude is a large language model..."
→
Vector: [0.123, -0.456, 0.789, ..., 0.234] (1536 dimensions)

Step 3: Store in a vector database.

text

Database: pgvector (PostgreSQL + vector extension)
        OR Pinecone, Weaviate, Milvus (cloud)

Table: documents
├─ id: chunk_id
├─ text: "Claude is a..."
├─ vector: [embeddings]
├─ source: "Claude API Overview.pdf"
└─ updated_at: 2024-06-15

Layer 2: Retrieval on query

When a user asks a question:

python

# 1. Embed the user's question
user_question = "How do I use Claude with streaming?"
query_vector = embed_model.embed(user_question)

# 2. Find similar documents in your database
similar = vector_db.search(
    query_vector,
    top_k=5,
    min_similarity=0.7
)

# 3. Result: Top-5 passages from your docs
retrieved = [
    {
        "text": "Claude supports streaming via server-sent events...",
        "source": "API Guide.pdf",
        "similarity": 0.94
    },
    ... (4 more)
]

Layer 3: Generation with Claude

python

# Construct the augmented prompt
prompt = f"""
Use the following context from Techcologic documentation:

{retrieved_context}

User question: {user_question}

Answer the question using ONLY the context above.
If the answer isn't in the context, say: "I don't have information on this."
Include citations: (Source: document_name)
"""

# Call Claude with your knowledge
response = claude.message(prompt, max_tokens=500)

Real example: internal knowledge base

Scenario: Techcologic's 50-page engineering handbook, constantly updated.

Without RAG:

Include the entire handbook in every prompt (150K tokens)
Cost: $2.25 per query (expensive!)
Fails when the handbook exceeds the context window

With RAG (the Techcologic approach):

Store handbook chunks in a vector database
Retrieve only relevant sections per query (2–5K tokens)
Cost: $0.03 per query (75× cheaper!)
Handbook can grow unlimited

Comparison:

Approach	Cost per query	Latency	Scalability	Updates
Naive (full context)	$2–5	5–10s	Limited to token window	Requires redeploy
RAG with pgvector	$0.02–0.05	1–2s	Unlimited docs	Instant
RAG + caching	$0.005–0.01	<500ms	Unlimited docs	Instant

Building RAG step by step

Step 1: Prepare documents

text

1. Collect your documents (PDFs, Markdown, text)
2. Extract text (PyPDF2, pdfplumber for PDFs)
3. Chunk into 500-1000 token pieces
4. Store in a database with metadata

Step 2: Set up the vector database

text

Option A: PostgreSQL + pgvector (self-hosted)
Option B: Pinecone (serverless)
Option C: Weaviate (open-source)

We recommend pgvector for most teams — it's cheap, reliable, debuggable.

Step 3: Embed & index

python

from anthropic import Anthropic

# Embed each document chunk
embeddings = model.embed(chunks)

# Store in the vector DB
vector_db.insert(chunks, embeddings, metadata)

Step 4: Build the retrieval function

python

def retrieve_context(question: str, top_k: int = 5):
    query_vector = embed_model.embed(question)
    results = vector_db.search(query_vector, top_k)
    return [r.text for r in results]

Step 5: Create the answer function

python

def answer_question(question: str):
    context = retrieve_context(question)
    prompt = f"""Context: {context}

    Question: {question}
    Answer:"""

    response = claude.message(prompt, max_tokens=500)
    return response

Common pitfalls (and how to avoid them)

Problem	Cause	Solution
Low-quality answers	Irrelevant documents retrieved	Improve chunking strategy, increase similarity threshold
High costs	Too many tokens sent to Claude	Optimize chunk size, retrieve fewer docs, use caching
Stale answers	Documents never updated	Set up automated sync, monitor freshness
Hallucination	Model invents info not in docs	Use a system prompt: "Only answer from provided context"

Techcologic's RAG stack

For production systems, we use:

text

Documents → Chunking (LangChain)
         → Embedding (text-embedding-3-small)
         → Storage (pgvector on RDS)
         → Retrieval (vector similarity search)
         → Generation (Claude API)
         → Monitoring (LangSmith, custom logging)

Result: production RAG systems that handle millions of queries, stay accurate, and cost under $0.02 per question.

Getting started today

If you're building with Claude and need to ground answers in your documents:

Start small — Pick 5–10 important docs
Chunk them — 500-token pieces
Embed them — Use a dedicated embedding model
Store them — PostgreSQL + pgvector (free tier available)
Test retrieval — Verify top-5 results make sense
Add Claude — Build the augmented prompt
Monitor — Track retrieval quality and token usage

This takes a weekend to prototype and a few days to production.

Ready to ship RAG? Book a Claude architecture call with Techcologic.

Key takeaways

RAG lets you augment Claude with your documents
Vector search finds relevant context in milliseconds
Costs drop 10–100× vs. naive approaches
Production RAG systems are reliable and maintainable

Written by The Techcologic Team.

The problem: how do you give Claude your company's knowledge?

The solution: retrieve relevant context, then generate

How Techcologic builds RAG systems

Layer 1: Embedding & vector search

Layer 2: Retrieval on query

Layer 3: Generation with Claude

Real example: internal knowledge base

Building RAG step by step

Step 1: Prepare documents

Step 2: Set up the vector database

Step 3: Embed & index

Step 4: Build the retrieval function

Step 5: Create the answer function

Common pitfalls (and how to avoid them)

Techcologic's RAG stack

Getting started today

Key takeaways

Building something with AI?