Back to Blog

RAG in Production: Chunking Strategies, pgvector vs Pinecone, Retrieval Quality, and Evaluation

Build production-grade RAG systems in 2026 — document chunking strategies, embedding models, pgvector vs Pinecone comparison, hybrid search (BM25 + vector), RAG

Viprasol Tech Team
June 29, 2026
13 min read

RAG in Production: Chunking Strategies, pgvector vs Pinecone, Retrieval Quality, and Evaluation

Retrieval-Augmented Generation (RAG) lets your LLM answer questions about your own data — documentation, knowledge bases, customer records — without fine-tuning. The core pipeline: chunk documents → embed chunks → store vectors → retrieve relevant chunks for each query → augment the LLM prompt with retrieved context.

The difference between a demo RAG and a production RAG is retrieval quality. Most RAG failures aren't LLM failures — they're retrieval failures: the wrong chunks are fetched, so the model either hallucinates or says "I don't know."


The RAG Pipeline

Documents → Chunking → Embedding → Vector Store
                                        ↓
User Query → Embedding → Vector Search → Relevant Chunks → LLM → Answer

Each step has quality tradeoffs.


Chunking Strategies

The chunking approach determines retrieval precision more than any other factor.

StrategyHowBest ForRisk
Fixed-sizeSplit every N tokens, overlap M tokensSimple, baselineSplits mid-sentence, loses context
SentenceSplit on sentence boundariesProse, articlesVariable chunk size
ParagraphSplit on paragraph breaksDocumentation, booksChunks too large/small
RecursiveTry paragraph → sentence → word until fitsGeneral purposeMore complex
SemanticCluster by embedding similarityDense documentsExpensive, slow
Document-awareRespect markdown headers, code blocks, tablesStructured docsFormat-specific
# scripts/chunker.py
from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
import tiktoken

enc = tiktoken.encoding_for_model('text-embedding-3-small')

def count_tokens(text: str) -> int:
    return len(enc.encode(text))

# Recursive chunker (recommended for most content)
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,          # tokens per chunk
    chunk_overlap=50,        # overlap to preserve context across chunks
    length_function=count_tokens,
    separators=["\n\n", "\n", ". ", " ", ""],  # Try these in order
)

# Markdown-aware chunker (for documentation sites)
def chunk_markdown_doc(content: str, source_url: str) -> list[dict]:
    # First split by headers to preserve document structure
    header_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[
        ("#", "h1"),
        ("##", "h2"),
        ("###", "h3"),
    ])
    sections = header_splitter.split_text(content)

    chunks = []
    for section in sections:
        # Then split large sections into token-sized chunks
        if count_tokens(section.page_content) > 512:
            sub_chunks = recursive_splitter.split_text(section.page_content)
            for i, chunk in enumerate(sub_chunks):
                chunks.append({
                    'text': chunk,
                    'metadata': {
                        **section.metadata,  # Includes header hierarchy
                        'source': source_url,
                        'chunk_index': i,
                    },
                })
        else:
            chunks.append({
                'text': section.page_content,
                'metadata': { **section.metadata, 'source': source_url, 'chunk_index': 0 },
            })

    return chunks

Chunk size guidance (tokens):

  • 128–256: High precision retrieval, less context per chunk
  • 512: Good default for most use cases
  • 1024: Better context, worse precision for specific facts
  • 2048: Usually too large; dilutes retrieval relevance


🤖 AI Is Not the Future — It Is Right Now

Businesses using AI automation cut manual work by 60–80%. We build production-ready AI systems — RAG pipelines, LLM integrations, custom ML models, and AI agent workflows.

  • LLM integration (OpenAI, Anthropic, Gemini, local models)
  • RAG systems that answer from your own data
  • AI agents that take real actions — not just chat
  • Custom ML models for prediction, classification, detection

pgvector vs Pinecone

pgvectorPineconeQdrant
TypePostgreSQL extensionManaged vector DBOSS + managed
SetupAdd extension to existing PGManaged SaaSDocker or managed
Scale< 5M vectors wellMillions–billionsMillions+
Hybrid searchBM25 via pg_bm25 (ParadeDB)Built-inBuilt-in
Cost (1M vectors)~$50/mo (RDS)~$70/mo~$60/mo (managed)
Best forExisting PostgreSQL stack, < 5M vectorsLarge scale, no infra managementProduction OSS preference

pgvector setup:

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Document chunks table
CREATE TABLE document_chunks (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  source_url  TEXT NOT NULL,
  content     TEXT NOT NULL,
  metadata    JSONB NOT NULL DEFAULT '{}',
  embedding   vector(1536) NOT NULL,  -- OpenAI text-embedding-3-small dimension
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- HNSW index for fast approximate nearest neighbor search
-- Faster queries but slower inserts than IVFFlat
CREATE INDEX idx_chunks_embedding ON document_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- For deterministic exact search (small collections < 100K):
-- CREATE INDEX idx_chunks_embedding ON document_chunks
--   USING ivfflat (embedding vector_cosine_ops)
--   WITH (lists = 100);
// lib/rag/store.ts
import { openai } from '../openai';
import { db } from '../db';

async function embed(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text,
  });
  return response.data[0].embedding;
}

// Store chunks with embeddings
export async function indexChunks(chunks: Array<{ text: string; metadata: object }>): Promise<void> {
  const embeddings = await Promise.all(
    chunks.map(c => embed(c.text))
  );

  // Batch insert via raw SQL (Prisma doesn't support vector type natively)
  for (let i = 0; i < chunks.length; i++) {
    await db.$executeRaw`
      INSERT INTO document_chunks (content, metadata, embedding)
      VALUES (${chunks[i].text}, ${JSON.stringify(chunks[i].metadata)}::jsonb,
              ${JSON.stringify(embeddings[i])}::vector)
    `;
  }
}

// Retrieve top-k relevant chunks for a query
export async function retrieveChunks(
  query: string,
  topK: number = 5,
  filter?: { sourceUrl?: string },
): Promise<Array<{ content: string; metadata: object; similarity: number }>> {
  const queryEmbedding = await embed(query);

  const results = await db.$queryRaw<Array<{ content: string; metadata: object; similarity: number }>>`
    SELECT
      content,
      metadata,
      1 - (embedding <=> ${JSON.stringify(queryEmbedding)}::vector) AS similarity
    FROM document_chunks
    WHERE 1=1
      ${filter?.sourceUrl ? db.$raw`AND metadata->>'source' = ${filter.sourceUrl}` : db.$raw``}
    ORDER BY embedding <=> ${JSON.stringify(queryEmbedding)}::vector
    LIMIT ${topK}
  `;

  return results;
}

Hybrid Search: Dense + Sparse

Pure vector search misses exact keyword matches. Hybrid search combines vector (semantic) with BM25 (keyword) for better recall:

# Hybrid search with RRF (Reciprocal Rank Fusion) reranking
from typing import List

def hybrid_search(query: str, top_k: int = 10) -> List[dict]:
    # Dense retrieval (vector similarity)
    dense_results = vector_search(query, top_k=top_k * 2)

    # Sparse retrieval (BM25 keyword)
    sparse_results = bm25_search(query, top_k=top_k * 2)

    # Reciprocal Rank Fusion
    scores = {}
    k = 60  # RRF constant

    for rank, result in enumerate(dense_results):
        doc_id = result['id']
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    for rank, result in enumerate(sparse_results):
        doc_id = result['id']
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    # Sort by combined RRF score
    ranked_ids = sorted(scores.keys(), key=lambda id: scores[id], reverse=True)

    # Fetch full documents for top results
    return [get_chunk(id) for id in ranked_ids[:top_k]]

⚡ Your Competitors Are Already Using AI — Are You?

We build AI systems that actually work in production — not demos that die in a Colab notebook. From data pipeline to deployed model to real business outcomes.

  • AI agent systems that run autonomously — not just chatbots
  • Integrates with your existing tools (CRM, ERP, Slack, etc.)
  • Explainable outputs — know why the model decided what it did
  • Free AI opportunity audit for your business

The RAG Prompt Pattern

// lib/rag/query.ts
export async function ragQuery(userQuestion: string): Promise<string> {
  // 1. Retrieve relevant chunks
  const chunks = await retrieveChunks(userQuestion, 5);

  if (chunks.length === 0) {
    return "I don't have information about that in the knowledge base.";
  }

  // 2. Build context from chunks
  const context = chunks
    .filter(c => c.similarity > 0.7)  // Filter low-relevance chunks
    .map((c, i) => `[Source ${i + 1}]: ${c.content}`)
    .join('\n\n');

  // 3. Augmented prompt
  const systemPrompt = `You are a helpful assistant. Answer questions using ONLY the provided context.
If the context doesn't contain enough information to answer, say "I don't have that information."
Do not make up information or use knowledge beyond the provided context.

Context:
${context}`;

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: userQuestion },
    ],
    temperature: 0.2,  // Low temperature for factual answers
    max_tokens: 500,
  });

  return response.choices[0].message.content ?? '';
}

RAG Evaluation with RAGAS

# scripts/evaluate_rag.py
from ragas import evaluate
from ragas.metrics import (
    faithfulness,            # Is the answer grounded in retrieved context?
    answer_relevancy,        # Is the answer relevant to the question?
    context_precision,       # Are retrieved chunks actually relevant?
    context_recall,          # Were all relevant chunks retrieved?
)
from datasets import Dataset

# Evaluation dataset
test_cases = [
    {
        "question": "What is the refund policy?",
        "answer": rag_query("What is the refund policy?"),  # Your RAG output
        "contexts": [c['content'] for c in retrieve_chunks("What is the refund policy?")],
        "ground_truth": "Refunds are available within 30 days of purchase...",  # Known correct answer
    },
    # ... more test cases
]

dataset = Dataset.from_list(test_cases)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])

print(results)
# faithfulness:      0.94  ← answers stick to context (< 0.9 = hallucination risk)
# answer_relevancy:  0.87  ← answers are relevant to questions
# context_precision: 0.78  ← retrieved chunks are relevant (< 0.7 = noisy retrieval)
# context_recall:    0.85  ← relevant information was retrieved

Working With Viprasol

We build production RAG systems — document ingestion pipelines, chunking strategies, pgvector and Pinecone integration, hybrid search, RAGAS evaluation, and the LLM orchestration layer that ties it together.

Talk to our team about RAG and AI knowledge base development.


See Also

Share this article:

About the Author

V

Viprasol Tech Team

Custom Software Development Specialists

The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.

MT4/MT5 EA DevelopmentAI Agent SystemsSaaS DevelopmentAlgorithmic Trading

Want to Implement AI in Your Business?

From chatbots to predictive models — harness the power of AI with a team that delivers.

Free consultation • No commitment • Response within 24 hours

Viprasol · AI Agent Systems

Ready to automate your business with AI agents?

We build custom multi-agent AI systems that handle sales, support, ops, and content — across Telegram, WhatsApp, Slack, and 20+ other platforms. We run our own business on these systems.