RAG in Production: Chunking Strategies, pgvector vs Pinecone, Retrieval Quality, and Evaluation
Build production-grade RAG systems in 2026 — document chunking strategies, embedding models, pgvector vs Pinecone comparison, hybrid search (BM25 + vector), RAG
RAG in Production: Chunking Strategies, pgvector vs Pinecone, Retrieval Quality, and Evaluation
Retrieval-Augmented Generation (RAG) lets your LLM answer questions about your own data — documentation, knowledge bases, customer records — without fine-tuning. The core pipeline: chunk documents → embed chunks → store vectors → retrieve relevant chunks for each query → augment the LLM prompt with retrieved context.
The difference between a demo RAG and a production RAG is retrieval quality. Most RAG failures aren't LLM failures — they're retrieval failures: the wrong chunks are fetched, so the model either hallucinates or says "I don't know."
The RAG Pipeline
Documents → Chunking → Embedding → Vector Store
↓
User Query → Embedding → Vector Search → Relevant Chunks → LLM → Answer
Each step has quality tradeoffs.
Chunking Strategies
The chunking approach determines retrieval precision more than any other factor.
| Strategy | How | Best For | Risk |
|---|---|---|---|
| Fixed-size | Split every N tokens, overlap M tokens | Simple, baseline | Splits mid-sentence, loses context |
| Sentence | Split on sentence boundaries | Prose, articles | Variable chunk size |
| Paragraph | Split on paragraph breaks | Documentation, books | Chunks too large/small |
| Recursive | Try paragraph → sentence → word until fits | General purpose | More complex |
| Semantic | Cluster by embedding similarity | Dense documents | Expensive, slow |
| Document-aware | Respect markdown headers, code blocks, tables | Structured docs | Format-specific |
# scripts/chunker.py
from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
import tiktoken
enc = tiktoken.encoding_for_model('text-embedding-3-small')
def count_tokens(text: str) -> int:
return len(enc.encode(text))
# Recursive chunker (recommended for most content)
recursive_splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # tokens per chunk
chunk_overlap=50, # overlap to preserve context across chunks
length_function=count_tokens,
separators=["\n\n", "\n", ". ", " ", ""], # Try these in order
)
# Markdown-aware chunker (for documentation sites)
def chunk_markdown_doc(content: str, source_url: str) -> list[dict]:
# First split by headers to preserve document structure
header_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[
("#", "h1"),
("##", "h2"),
("###", "h3"),
])
sections = header_splitter.split_text(content)
chunks = []
for section in sections:
# Then split large sections into token-sized chunks
if count_tokens(section.page_content) > 512:
sub_chunks = recursive_splitter.split_text(section.page_content)
for i, chunk in enumerate(sub_chunks):
chunks.append({
'text': chunk,
'metadata': {
**section.metadata, # Includes header hierarchy
'source': source_url,
'chunk_index': i,
},
})
else:
chunks.append({
'text': section.page_content,
'metadata': { **section.metadata, 'source': source_url, 'chunk_index': 0 },
})
return chunks
Chunk size guidance (tokens):
- 128–256: High precision retrieval, less context per chunk
- 512: Good default for most use cases
- 1024: Better context, worse precision for specific facts
-
2048: Usually too large; dilutes retrieval relevance
🤖 AI Is Not the Future — It Is Right Now
Businesses using AI automation cut manual work by 60–80%. We build production-ready AI systems — RAG pipelines, LLM integrations, custom ML models, and AI agent workflows.
- LLM integration (OpenAI, Anthropic, Gemini, local models)
- RAG systems that answer from your own data
- AI agents that take real actions — not just chat
- Custom ML models for prediction, classification, detection
pgvector vs Pinecone
| pgvector | Pinecone | Qdrant | |
|---|---|---|---|
| Type | PostgreSQL extension | Managed vector DB | OSS + managed |
| Setup | Add extension to existing PG | Managed SaaS | Docker or managed |
| Scale | < 5M vectors well | Millions–billions | Millions+ |
| Hybrid search | BM25 via pg_bm25 (ParadeDB) | Built-in | Built-in |
| Cost (1M vectors) | ~$50/mo (RDS) | ~$70/mo | ~$60/mo (managed) |
| Best for | Existing PostgreSQL stack, < 5M vectors | Large scale, no infra management | Production OSS preference |
pgvector setup:
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Document chunks table
CREATE TABLE document_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source_url TEXT NOT NULL,
content TEXT NOT NULL,
metadata JSONB NOT NULL DEFAULT '{}',
embedding vector(1536) NOT NULL, -- OpenAI text-embedding-3-small dimension
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- HNSW index for fast approximate nearest neighbor search
-- Faster queries but slower inserts than IVFFlat
CREATE INDEX idx_chunks_embedding ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- For deterministic exact search (small collections < 100K):
-- CREATE INDEX idx_chunks_embedding ON document_chunks
-- USING ivfflat (embedding vector_cosine_ops)
-- WITH (lists = 100);
// lib/rag/store.ts
import { openai } from '../openai';
import { db } from '../db';
async function embed(text: string): Promise<number[]> {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: text,
});
return response.data[0].embedding;
}
// Store chunks with embeddings
export async function indexChunks(chunks: Array<{ text: string; metadata: object }>): Promise<void> {
const embeddings = await Promise.all(
chunks.map(c => embed(c.text))
);
// Batch insert via raw SQL (Prisma doesn't support vector type natively)
for (let i = 0; i < chunks.length; i++) {
await db.$executeRaw`
INSERT INTO document_chunks (content, metadata, embedding)
VALUES (${chunks[i].text}, ${JSON.stringify(chunks[i].metadata)}::jsonb,
${JSON.stringify(embeddings[i])}::vector)
`;
}
}
// Retrieve top-k relevant chunks for a query
export async function retrieveChunks(
query: string,
topK: number = 5,
filter?: { sourceUrl?: string },
): Promise<Array<{ content: string; metadata: object; similarity: number }>> {
const queryEmbedding = await embed(query);
const results = await db.$queryRaw<Array<{ content: string; metadata: object; similarity: number }>>`
SELECT
content,
metadata,
1 - (embedding <=> ${JSON.stringify(queryEmbedding)}::vector) AS similarity
FROM document_chunks
WHERE 1=1
${filter?.sourceUrl ? db.$raw`AND metadata->>'source' = ${filter.sourceUrl}` : db.$raw``}
ORDER BY embedding <=> ${JSON.stringify(queryEmbedding)}::vector
LIMIT ${topK}
`;
return results;
}
Hybrid Search: Dense + Sparse
Pure vector search misses exact keyword matches. Hybrid search combines vector (semantic) with BM25 (keyword) for better recall:
# Hybrid search with RRF (Reciprocal Rank Fusion) reranking
from typing import List
def hybrid_search(query: str, top_k: int = 10) -> List[dict]:
# Dense retrieval (vector similarity)
dense_results = vector_search(query, top_k=top_k * 2)
# Sparse retrieval (BM25 keyword)
sparse_results = bm25_search(query, top_k=top_k * 2)
# Reciprocal Rank Fusion
scores = {}
k = 60 # RRF constant
for rank, result in enumerate(dense_results):
doc_id = result['id']
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
for rank, result in enumerate(sparse_results):
doc_id = result['id']
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
# Sort by combined RRF score
ranked_ids = sorted(scores.keys(), key=lambda id: scores[id], reverse=True)
# Fetch full documents for top results
return [get_chunk(id) for id in ranked_ids[:top_k]]
⚡ Your Competitors Are Already Using AI — Are You?
We build AI systems that actually work in production — not demos that die in a Colab notebook. From data pipeline to deployed model to real business outcomes.
- AI agent systems that run autonomously — not just chatbots
- Integrates with your existing tools (CRM, ERP, Slack, etc.)
- Explainable outputs — know why the model decided what it did
- Free AI opportunity audit for your business
The RAG Prompt Pattern
// lib/rag/query.ts
export async function ragQuery(userQuestion: string): Promise<string> {
// 1. Retrieve relevant chunks
const chunks = await retrieveChunks(userQuestion, 5);
if (chunks.length === 0) {
return "I don't have information about that in the knowledge base.";
}
// 2. Build context from chunks
const context = chunks
.filter(c => c.similarity > 0.7) // Filter low-relevance chunks
.map((c, i) => `[Source ${i + 1}]: ${c.content}`)
.join('\n\n');
// 3. Augmented prompt
const systemPrompt = `You are a helpful assistant. Answer questions using ONLY the provided context.
If the context doesn't contain enough information to answer, say "I don't have that information."
Do not make up information or use knowledge beyond the provided context.
Context:
${context}`;
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: userQuestion },
],
temperature: 0.2, // Low temperature for factual answers
max_tokens: 500,
});
return response.choices[0].message.content ?? '';
}
RAG Evaluation with RAGAS
# scripts/evaluate_rag.py
from ragas import evaluate
from ragas.metrics import (
faithfulness, # Is the answer grounded in retrieved context?
answer_relevancy, # Is the answer relevant to the question?
context_precision, # Are retrieved chunks actually relevant?
context_recall, # Were all relevant chunks retrieved?
)
from datasets import Dataset
# Evaluation dataset
test_cases = [
{
"question": "What is the refund policy?",
"answer": rag_query("What is the refund policy?"), # Your RAG output
"contexts": [c['content'] for c in retrieve_chunks("What is the refund policy?")],
"ground_truth": "Refunds are available within 30 days of purchase...", # Known correct answer
},
# ... more test cases
]
dataset = Dataset.from_list(test_cases)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
print(results)
# faithfulness: 0.94 ← answers stick to context (< 0.9 = hallucination risk)
# answer_relevancy: 0.87 ← answers are relevant to questions
# context_precision: 0.78 ← retrieved chunks are relevant (< 0.7 = noisy retrieval)
# context_recall: 0.85 ← relevant information was retrieved
Working With Viprasol
We build production RAG systems — document ingestion pipelines, chunking strategies, pgvector and Pinecone integration, hybrid search, RAGAS evaluation, and the LLM orchestration layer that ties it together.
→ Talk to our team about RAG and AI knowledge base development.
See Also
- LLM Integration — OpenAI SDK, streaming, and function calling
- Vector Database and RAG — vector database fundamentals
- Machine Learning Model Deployment — serving AI models
- PostgreSQL Performance — optimizing pgvector queries
- AI/ML Services — AI product development
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Want to Implement AI in Your Business?
From chatbots to predictive models — harness the power of AI with a team that delivers.
Free consultation • No commitment • Response within 24 hours
Ready to automate your business with AI agents?
We build custom multi-agent AI systems that handle sales, support, ops, and content — across Telegram, WhatsApp, Slack, and 20+ other platforms. We run our own business on these systems.