RAG in Production: Chunking Strategies, pgvector vs Pinecone, Retrieval Quality, and Evaluation

Retrieval-Augmented Generation (RAG) lets your LLM answer questions about your own data — documentation, knowledge bases, customer records — without fine-tuning. The core pipeline: chunk documents → embed chunks → store vectors → retrieve relevant chunks for each query → augment the LLM prompt with retrieved context.

The difference between a demo RAG and a production RAG is retrieval quality. Most RAG failures aren't LLM failures — they're retrieval failures: the wrong chunks are fetched, so the model either hallucinates or says "I don't know."

The RAG Pipeline

Documents → Chunking → Embedding → Vector Store
                                        ↓
User Query → Embedding → Vector Search → Relevant Chunks → LLM → Answer

Each step has quality tradeoffs.

Chunking Strategies

The chunking approach determines retrieval precision more than any other factor.

Strategy	How	Best For	Risk
Fixed-size	Split every N tokens, overlap M tokens	Simple, baseline	Splits mid-sentence, loses context
Sentence	Split on sentence boundaries	Prose, articles	Variable chunk size
Paragraph	Split on paragraph breaks	Documentation, books	Chunks too large/small
Recursive	Try paragraph → sentence → word until fits	General purpose	More complex
Semantic	Cluster by embedding similarity	Dense documents	Expensive, slow
Document-aware	Respect markdown headers, code blocks, tables	Structured docs	Format-specific

# scripts/chunker.py
from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
import tiktoken

enc = tiktoken.encoding_for_model('text-embedding-3-small')

def count_tokens(text: str) -> int:
    return len(enc.encode(text))

# Recursive chunker (recommended for most content)
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,          # tokens per chunk
    chunk_overlap=50,        # overlap to preserve context across chunks
    length_function=count_tokens,
    separators=["\n\n", "\n", ". ", " ", ""],  # Try these in order
)

# Markdown-aware chunker (for documentation sites)
def chunk_markdown_doc(content: str, source_url: str) -> list[dict]:
    # First split by headers to preserve document structure
    header_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[
        ("#", "h1"),
        ("##", "h2"),
        ("###", "h3"),
    ])
    sections = header_splitter.split_text(content)

    chunks = []
    for section in sections:
        # Then split large sections into token-sized chunks
        if count_tokens(section.page_content) > 512:
            sub_chunks = recursive_splitter.split_text(section.page_content)
            for i, chunk in enumerate(sub_chunks):
                chunks.append({
                    'text': chunk,
                    'metadata': {
                        **section.metadata,  # Includes header hierarchy
                        'source': source_url,
                        'chunk_index': i,
                    },
                })
        else:
            chunks.append({
                'text': section.page_content,
                'metadata': { **section.metadata, 'source': source_url, 'chunk_index': 0 },
            })

    return chunks

Chunk size guidance (tokens):

128–256: High precision retrieval, less context per chunk
512: Good default for most use cases
1024: Better context, worse precision for specific facts
2048: Usually too large; dilutes retrieval relevance

🤖 AI Is Not the Future — It Is Right Now

Businesses using AI automation cut manual work by 60–80%. We build production-ready AI systems — RAG pipelines, LLM integrations, custom ML models, and AI agent workflows.

LLM integration (OpenAI, Anthropic, Gemini, local models)
RAG systems that answer from your own data
AI agents that take real actions — not just chat
Custom ML models for prediction, classification, detection

Explore AI for My Business WhatsApp

pgvector vs Pinecone

	pgvector	Pinecone	Qdrant
Type	PostgreSQL extension	Managed vector DB	OSS + managed
Setup	Add extension to existing PG	Managed SaaS	Docker or managed
Scale	< 5M vectors well	Millions–billions	Millions+
Hybrid search	BM25 via pg_bm25 (ParadeDB)	Built-in	Built-in
Cost (1M vectors)	~$50/mo (RDS)	~$70/mo	~$60/mo (managed)
Best for	Existing PostgreSQL stack, < 5M vectors	Large scale, no infra management	Production OSS preference

pgvector setup:

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Document chunks table
CREATE TABLE document_chunks (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  source_url  TEXT NOT NULL,
  content     TEXT NOT NULL,
  metadata    JSONB NOT NULL DEFAULT '{}',
  embedding   vector(1536) NOT NULL,  -- OpenAI text-embedding-3-small dimension
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- HNSW index for fast approximate nearest neighbor search
-- Faster queries but slower inserts than IVFFlat
CREATE INDEX idx_chunks_embedding ON document_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- For deterministic exact search (small collections < 100K):
-- CREATE INDEX idx_chunks_embedding ON document_chunks
--   USING ivfflat (embedding vector_cosine_ops)
--   WITH (lists = 100);

// lib/rag/store.ts
import { openai } from '../openai';
import { db } from '../db';

async function embed(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text,
  });
  return response.data[0].embedding;
}

// Store chunks with embeddings
export async function indexChunks(chunks: Array<{ text: string; metadata: object }>): Promise<void> {
  const embeddings = await Promise.all(
    chunks.map(c => embed(c.text))
  );

  // Batch insert via raw SQL (Prisma doesn't support vector type natively)
  for (let i = 0; i < chunks.length; i++) {
    await db.$executeRaw`
      INSERT INTO document_chunks (content, metadata, embedding)
      VALUES (${chunks[i].text}, ${JSON.stringify(chunks[i].metadata)}::jsonb,
              ${JSON.stringify(embeddings[i])}::vector)
    `;
  }
}

// Retrieve top-k relevant chunks for a query
export async function retrieveChunks(
  query: string,
  topK: number = 5,
  filter?: { sourceUrl?: string },
): Promise<Array<{ content: string; metadata: object; similarity: number }>> {
  const queryEmbedding = await embed(query);

  const results = await db.$queryRaw<Array<{ content: string; metadata: object; similarity: number }>>`
    SELECT
      content,
      metadata,
      1 - (embedding <=> ${JSON.stringify(queryEmbedding)}::vector) AS similarity
    FROM document_chunks
    WHERE 1=1
      ${filter?.sourceUrl ? db.$raw`AND metadata->>'source' = ${filter.sourceUrl}` : db.$raw``}
    ORDER BY embedding <=> ${JSON.stringify(queryEmbedding)}::vector
    LIMIT ${topK}
  `;

  return results;
}

Hybrid Search: Dense + Sparse

Pure vector search misses exact keyword matches. Hybrid search combines vector (semantic) with BM25 (keyword) for better recall:

# Hybrid search with RRF (Reciprocal Rank Fusion) reranking
from typing import List

def hybrid_search(query: str, top_k: int = 10) -> List[dict]:
    # Dense retrieval (vector similarity)
    dense_results = vector_search(query, top_k=top_k * 2)

    # Sparse retrieval (BM25 keyword)
    sparse_results = bm25_search(query, top_k=top_k * 2)

    # Reciprocal Rank Fusion
    scores = {}
    k = 60  # RRF constant

    for rank, result in enumerate(dense_results):
        doc_id = result['id']
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    for rank, result in enumerate(sparse_results):
        doc_id = result['id']
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    # Sort by combined RRF score
    ranked_ids = sorted(scores.keys(), key=lambda id: scores[id], reverse=True)

    # Fetch full documents for top results
    return [get_chunk(id) for id in ranked_ids[:top_k]]

⚡ Your Competitors Are Already Using AI — Are You?

We build AI systems that actually work in production — not demos that die in a Colab notebook. From data pipeline to deployed model to real business outcomes.

AI agent systems that run autonomously — not just chatbots
Integrates with your existing tools (CRM, ERP, Slack, etc.)
Explainable outputs — know why the model decided what it did
Free AI opportunity audit for your business

Get a Free AI Audit WhatsApp

The RAG Prompt Pattern

// lib/rag/query.ts
export async function ragQuery(userQuestion: string): Promise<string> {
  // 1. Retrieve relevant chunks
  const chunks = await retrieveChunks(userQuestion, 5);

  if (chunks.length === 0) {
    return "I don't have information about that in the knowledge base.";
  }

  // 2. Build context from chunks
  const context = chunks
    .filter(c => c.similarity > 0.7)  // Filter low-relevance chunks
    .map((c, i) => `[Source ${i + 1}]: ${c.content}`)
    .join('\n\n');

  // 3. Augmented prompt
  const systemPrompt = `You are a helpful assistant. Answer questions using ONLY the provided context.
If the context doesn't contain enough information to answer, say "I don't have that information."
Do not make up information or use knowledge beyond the provided context.

Context:
${context}`;

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: userQuestion },
    ],
    temperature: 0.2,  // Low temperature for factual answers
    max_tokens: 500,
  });

  return response.choices[0].message.content ?? '';
}

RAG Evaluation with RAGAS

# scripts/evaluate_rag.py
from ragas import evaluate
from ragas.metrics import (
    faithfulness,            # Is the answer grounded in retrieved context?
    answer_relevancy,        # Is the answer relevant to the question?
    context_precision,       # Are retrieved chunks actually relevant?
    context_recall,          # Were all relevant chunks retrieved?
)
from datasets import Dataset

# Evaluation dataset
test_cases = [
    {
        "question": "What is the refund policy?",
        "answer": rag_query("What is the refund policy?"),  # Your RAG output
        "contexts": [c['content'] for c in retrieve_chunks("What is the refund policy?")],
        "ground_truth": "Refunds are available within 30 days of purchase...",  # Known correct answer
    },
    # ... more test cases
]

dataset = Dataset.from_list(test_cases)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])

print(results)
# faithfulness:      0.94  ← answers stick to context (< 0.9 = hallucination risk)
# answer_relevancy:  0.87  ← answers are relevant to questions
# context_precision: 0.78  ← retrieved chunks are relevant (< 0.7 = noisy retrieval)
# context_recall:    0.85  ← relevant information was retrieved

Working With Viprasol

We build production RAG systems — document ingestion pipelines, chunking strategies, pgvector and Pinecone integration, hybrid search, RAGAS evaluation, and the LLM orchestration layer that ties it together.

→ Talk to our team about RAG and AI knowledge base development.

RAG in Production: Chunking Strategies, pgvector vs Pinecone, Retrieval Quality, and Evaluation

RAG in Production: Chunking Strategies, pgvector vs Pinecone, Retrieval Quality, and Evaluation

The RAG Pipeline

Chunking Strategies

🤖 AI Is Not the Future — It Is Right Now

pgvector vs Pinecone

Hybrid Search: Dense + Sparse

⚡ Your Competitors Are Already Using AI — Are You?

The RAG Prompt Pattern

RAG Evaluation with RAGAS

Working With Viprasol

See Also

Viprasol Tech Team

Want to Implement AI in Your Business?

Ready to automate your business with AI agents?

Related Articles

Vector Databases: Choosing the Right One for Semantic Search and RAG

Building AI Product Features in 2026: Embeddings, Semantic Search UX, and Content Generation

Vector Databases: Power Your AI Applications