Back to Blog

PostgreSQL Statistics and ANALYZE: pg_stats, Autovacuum Tuning, and Query Planner Hints

Understand and tune PostgreSQL query planner statistics. Covers pg_stats and pg_statistic, ANALYZE frequency, autovacuum tuning for high-write tables, default_statistics_target, extended statistics for correlated columns, and planner configuration parameters.

Viprasol Tech Team
June 4, 2027
11 min read

The PostgreSQL query planner makes decisions based on statistics — estimates of how many rows a condition will match, how values are distributed across a column, and whether columns are correlated. Stale or inaccurate statistics cause the planner to choose the wrong plan: a sequential scan instead of an index scan, or nested loop instead of hash join. Understanding how statistics work lets you fix bad plans without rewriting queries.

How PostgreSQL Uses Statistics

-- The planner reads from pg_stats (a view over pg_statistic)
SELECT
  tablename,
  attname              AS column,
  n_distinct,          -- Estimated distinct values (-1 = all distinct; 0.5 = 50% distinct)
  null_frac,           -- Fraction of NULLs
  avg_width,           -- Average column width in bytes
  most_common_vals,    -- Most frequent values (MCV list)
  most_common_freqs,   -- Frequency of each MCV
  histogram_bounds     -- Bucket boundaries for range queries
FROM pg_stats
WHERE tablename = 'events'
ORDER BY attname;

-- n_distinct interpretation:
-- n_distinct > 0: absolute estimate (e.g., 500 = exactly 500 distinct values)
-- n_distinct < 0: fraction of rows (e.g., -1 = all rows distinct, -0.5 = 50% distinct)
-- n_distinct = 0: statistics not collected yet

-- Check if stats are fresh
SELECT
  schemaname,
  tablename,
  n_live_tup,
  n_dead_tup,
  last_analyze,
  last_autoanalyze,
  -- How stale are the stats? (rows changed since last analyze)
  n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname IN ('events', 'users', 'projects')
ORDER BY n_mod_since_analyze DESC;

Running ANALYZE

-- Analyze a single table (fast — doesn't block reads/writes)
ANALYZE events;

-- Analyze specific columns only (even faster)
ANALYZE events (workspace_id, name, occurred_at);

-- Analyze entire database (run after bulk imports)
ANALYZE VERBOSE;

-- Check how long ANALYZE takes (important for autovacuum tuning)
\timing
ANALYZE events;
-- Timing: 1200 ms (example)

-- If ANALYZE takes > 30 seconds: table is very large or statistics_target is high
-- Consider per-column statistics_target instead of global

-- Force fresh statistics before an important query:
BEGIN;
ANALYZE events;
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;
ROLLBACK;  -- Don't commit — just checking the plan

☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

  • AWS, GCP, Azure certified engineers
  • Infrastructure as Code (Terraform, CDK)
  • Docker, Kubernetes, GitHub Actions CI/CD
  • Typical audit recovers $500–$3,000/month in savings

statistics_target: Control Sample Size

-- Default is 100 (enough for most tables)
-- Higher = better estimates for skewed distributions, but slower ANALYZE

-- Check current global setting
SHOW default_statistics_target;  -- 100

-- Increase for tables with highly skewed distributions
-- (e.g., events.name has 10 common values but 10,000 total)
ALTER TABLE events
  ALTER COLUMN name SET STATISTICS 500;  -- Per-column override

ALTER TABLE events
  ALTER COLUMN workspace_id SET STATISTICS 200;

-- Very high-cardinality UUID columns: statistics don't help much
-- Keep at 100 or lower
ALTER TABLE events
  ALTER COLUMN id SET STATISTICS 50;

-- After changing statistics_target: must re-ANALYZE
ANALYZE events;

-- Verify the change
SELECT attname, attstattarget
FROM pg_attribute
WHERE attrelid = 'events'::regclass
  AND attstattarget <> -1  -- -1 = inherit from default
ORDER BY attname;

Autovacuum Tuning for High-Write Tables

Autovacuum triggers ANALYZE automatically, but default thresholds are tuned for medium-traffic tables. High-write tables (like an events table with millions of inserts per day) need more aggressive settings:

-- Default autovacuum ANALYZE trigger:
-- autovacuum_analyze_threshold = 50 (rows)
-- autovacuum_analyze_scale_factor = 0.2 (20% of table)
-- Total: 50 + (0.2 × n_live_tup) changed rows

-- For a 10M-row events table: threshold = 50 + (0.2 × 10,000,000) = 2,000,050 rows
-- This means stats could be VERY stale on a busy table

-- Override autovacuum settings per-table (no restart required)
ALTER TABLE events SET (
  autovacuum_analyze_scale_factor = 0.01,   -- 1% instead of 20%
  autovacuum_analyze_threshold    = 1000,   -- 1,000 rows instead of 50
  autovacuum_vacuum_scale_factor  = 0.01,
  autovacuum_vacuum_threshold     = 1000
);

-- For a 10M-row table: threshold = 1,000 + (0.01 × 10,000,000) = 101,000 rows
-- Much more responsive to traffic patterns

-- Check autovacuum is running (not blocked)
SELECT
  pid,
  query,
  wait_event_type,
  wait_event,
  age(clock_timestamp(), query_start) AS duration
FROM pg_stat_activity
WHERE query LIKE 'autovacuum%'
ORDER BY duration DESC;

-- Check tables that need ANALYZE most urgently
SELECT
  schemaname || '.' || relname AS table,
  n_mod_since_analyze,
  last_autoanalyze,
  n_live_tup
FROM pg_stat_user_tables
WHERE n_mod_since_analyze > 10000
ORDER BY n_mod_since_analyze DESC
LIMIT 20;

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

  • Staging + production environments with feature flags
  • Automated security scanning in the pipeline
  • Uptime monitoring + alerting + runbook automation
  • On-call support handover docs included

Extended Statistics: Correlated Columns

PostgreSQL assumes columns are independent by default. If they're actually correlated (e.g., country and currency), estimates can be wildly off:

-- Problem: planner assumes country='US' AND currency='USD' are independent
-- Actual correlation: 95% of US rows have currency='USD'
-- Planner estimate: 0.3 × 0.6 = 0.18 selectivity (too low)
-- Actual selectivity: ~0.3 (matches country='US')

-- Solution: extended statistics on correlated columns
CREATE STATISTICS stats_country_currency (dependencies)
  ON country, currency
  FROM orders;

ANALYZE orders;

-- Verify the statistics were collected
SELECT stxname, stxkeys, stxkind, stxdependencies
FROM pg_statistic_ext
JOIN pg_statistic_ext_data ON pg_statistic_ext.oid = pg_statistic_ext_data.stxoid
WHERE stxrelid = 'orders'::regclass;

-- After creating extended statistics, EXPLAIN should show better row estimates
EXPLAIN (ANALYZE)
SELECT * FROM orders
WHERE country = 'US' AND currency = 'USD';
-- Before: estimated X rows, actual Y rows (large discrepancy)
-- After: estimated X rows, actual Y rows (much closer)

-- Extended statistics types:
-- dependencies: column correlation (most common use case)
-- ndistinct:    distinct value estimates for column groups
-- mcv:          most common value combinations

CREATE STATISTICS stats_workspace_event (dependencies, ndistinct)
  ON workspace_id, name
  FROM events;

Diagnosing Bad Plans with EXPLAIN

-- Compare estimated vs actual rows — large discrepancy = stale/wrong stats
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT *
FROM events
WHERE workspace_id = '550e8400-e29b-41d4-a716-446655440000'
  AND name = 'project.created'
  AND occurred_at >= NOW() - INTERVAL '30 days';

-- Red flags in EXPLAIN output:
-- "rows=1 (actual rows=50000)" → stats very stale; ANALYZE needed
-- "Seq Scan" on large table with selective WHERE → wrong plan; may need ANALYZE + stats target increase
-- "Nested Loop" on large join → hash join would be better; row estimate likely wrong

-- Force the planner to reconsider after ANALYZE:
-- SET enable_seqscan = off;  ← Only for debugging, not production
-- SET enable_nestloop = off; ← Ditto

-- Production-safe approach: check planner parameters
SHOW enable_hashjoin;     -- on
SHOW enable_indexscan;    -- on
SHOW random_page_cost;    -- 4.0 (default, suitable for spinning disks)
-- On SSD/NVMe: lower this to reduce planner bias toward sequential scans
SET random_page_cost = 1.1;  -- Globally: ALTER SYSTEM SET random_page_cost = 1.1; then SELECT pg_reload_conf();

SHOW effective_cache_size;  -- 4GB default; set to ~75% of RAM
-- ALTER SYSTEM SET effective_cache_size = '12GB';  -- For 16GB RAM server

Planner Configuration for AWS RDS / Aurora

-- Check current settings on RDS (can't edit postgresql.conf directly)
SELECT name, setting, unit, source
FROM pg_settings
WHERE name IN (
  'random_page_cost',
  'effective_cache_size',
  'work_mem',
  'maintenance_work_mem',
  'default_statistics_target',
  'autovacuum_analyze_scale_factor'
)
ORDER BY name;

-- Change via RDS Parameter Group (Terraform):
-- resource "aws_db_parameter_group" "postgres" {
--   parameter { name = "random_page_cost"     value = "1.1" }
--   parameter { name = "effective_cache_size"  value = "12288" }  # MB
--   parameter { name = "work_mem"              value = "65536"  }  # 64MB
--   parameter { name = "default_statistics_target" value = "200" }
-- }

-- Check if a parameter requires restart
SELECT name, pending_restart
FROM pg_settings
WHERE pending_restart = true;

TypeScript: Statistics Health Check

// lib/db/stats-health.ts — run periodically to catch stale statistics

import { prisma } from "@/lib/prisma";

interface TableStatHealth {
  table:              string;
  liveRows:           number;
  modsSinceAnalyze:   number;
  stalePct:           number;   // % of rows changed since last analyze
  lastAutoanalyze:    Date | null;
  needsAnalyze:       boolean;
}

export async function checkStatisticsHealth(): Promise<TableStatHealth[]> {
  const rows = await prisma.$queryRaw<{
    table_name:           string;
    n_live_tup:           bigint;
    n_mod_since_analyze:  bigint;
    last_autoanalyze:     Date | null;
  }[]>`
    SELECT
      relname AS table_name,
      n_live_tup,
      n_mod_since_analyze,
      last_autoanalyze
    FROM pg_stat_user_tables
    WHERE schemaname = 'public'
    ORDER BY n_mod_since_analyze DESC
    LIMIT 20
  `;

  return rows.map((row) => {
    const liveRows         = Number(row.n_live_tup);
    const modsSinceAnalyze = Number(row.n_mod_since_analyze);
    const stalePct         = liveRows > 0
      ? Math.round((modsSinceAnalyze / liveRows) * 100)
      : 0;

    return {
      table:            row.table_name,
      liveRows,
      modsSinceAnalyze,
      stalePct,
      lastAutoanalyze:  row.last_autoanalyze,
      needsAnalyze:     stalePct > 10 || modsSinceAnalyze > 100_000,
    };
  });
}

Cost and Timeline Estimates

ScopeTeamTimelineCost Range
Statistics audit (pg_stats + EXPLAIN review)1 devHalf a day$150–300
Autovacuum tuning per-table1 devHalf a day$150–300
Extended statistics for correlated columns1 dev1 day$300–600
Full query planner optimization engagement1–2 devs2–3 days$800–1,500

See Also


Working With Viprasol

Stale statistics are the most common cause of mysteriously slow queries after a table grows — the planner's row estimates are wrong so it picks a sequential scan instead of an index. Our team audits pg_stat_user_tables for tables with high n_mod_since_analyze, tunes autovacuum thresholds per-table for high-write workloads, sets per-column statistics_target for skewed distributions, and adds extended statistics for correlated column pairs.

What we deliver:

  • pg_stats query: n_distinct, histogram_bounds, null_frac per column
  • pg_stat_user_tables freshness check: n_mod_since_analyze + last_autoanalyze
  • Per-table autovacuum override: autovacuum_analyze_scale_factor = 0.01 for high-write tables
  • Per-column SET STATISTICS 500 for skewed distributions + re-ANALYZE
  • CREATE STATISTICS (dependencies) for correlated column pairs
  • RDS Parameter Group Terraform: random_page_cost = 1.1, effective_cache_size, work_mem
  • checkStatisticsHealth() TypeScript function: stalePct > 10% → needsAnalyze flag

Talk to our team about your PostgreSQL performance optimization →

Or explore our cloud and database services.

Share this article:

About the Author

V

Viprasol Tech Team

Custom Software Development Specialists

The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.

MT4/MT5 EA DevelopmentAI Agent SystemsSaaS DevelopmentAlgorithmic Trading

Need DevOps & Cloud Expertise?

Scale your infrastructure with confidence. AWS, GCP, Azure certified team.

Free consultation • No commitment • Response within 24 hours

Viprasol · Big Data & Analytics

Making sense of your data at scale?

Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.