PostgreSQL Statistics and ANALYZE: pg_stats, Autovacuum Tuning, and Query Planner Hints
Understand and tune PostgreSQL query planner statistics. Covers pg_stats and pg_statistic, ANALYZE frequency, autovacuum tuning for high-write tables, default_statistics_target, extended statistics for correlated columns, and planner configuration parameters.
The PostgreSQL query planner makes decisions based on statistics — estimates of how many rows a condition will match, how values are distributed across a column, and whether columns are correlated. Stale or inaccurate statistics cause the planner to choose the wrong plan: a sequential scan instead of an index scan, or nested loop instead of hash join. Understanding how statistics work lets you fix bad plans without rewriting queries.
How PostgreSQL Uses Statistics
-- The planner reads from pg_stats (a view over pg_statistic)
SELECT
tablename,
attname AS column,
n_distinct, -- Estimated distinct values (-1 = all distinct; 0.5 = 50% distinct)
null_frac, -- Fraction of NULLs
avg_width, -- Average column width in bytes
most_common_vals, -- Most frequent values (MCV list)
most_common_freqs, -- Frequency of each MCV
histogram_bounds -- Bucket boundaries for range queries
FROM pg_stats
WHERE tablename = 'events'
ORDER BY attname;
-- n_distinct interpretation:
-- n_distinct > 0: absolute estimate (e.g., 500 = exactly 500 distinct values)
-- n_distinct < 0: fraction of rows (e.g., -1 = all rows distinct, -0.5 = 50% distinct)
-- n_distinct = 0: statistics not collected yet
-- Check if stats are fresh
SELECT
schemaname,
tablename,
n_live_tup,
n_dead_tup,
last_analyze,
last_autoanalyze,
-- How stale are the stats? (rows changed since last analyze)
n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname IN ('events', 'users', 'projects')
ORDER BY n_mod_since_analyze DESC;
Running ANALYZE
-- Analyze a single table (fast — doesn't block reads/writes)
ANALYZE events;
-- Analyze specific columns only (even faster)
ANALYZE events (workspace_id, name, occurred_at);
-- Analyze entire database (run after bulk imports)
ANALYZE VERBOSE;
-- Check how long ANALYZE takes (important for autovacuum tuning)
\timing
ANALYZE events;
-- Timing: 1200 ms (example)
-- If ANALYZE takes > 30 seconds: table is very large or statistics_target is high
-- Consider per-column statistics_target instead of global
-- Force fresh statistics before an important query:
BEGIN;
ANALYZE events;
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;
ROLLBACK; -- Don't commit — just checking the plan
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
statistics_target: Control Sample Size
-- Default is 100 (enough for most tables)
-- Higher = better estimates for skewed distributions, but slower ANALYZE
-- Check current global setting
SHOW default_statistics_target; -- 100
-- Increase for tables with highly skewed distributions
-- (e.g., events.name has 10 common values but 10,000 total)
ALTER TABLE events
ALTER COLUMN name SET STATISTICS 500; -- Per-column override
ALTER TABLE events
ALTER COLUMN workspace_id SET STATISTICS 200;
-- Very high-cardinality UUID columns: statistics don't help much
-- Keep at 100 or lower
ALTER TABLE events
ALTER COLUMN id SET STATISTICS 50;
-- After changing statistics_target: must re-ANALYZE
ANALYZE events;
-- Verify the change
SELECT attname, attstattarget
FROM pg_attribute
WHERE attrelid = 'events'::regclass
AND attstattarget <> -1 -- -1 = inherit from default
ORDER BY attname;
Autovacuum Tuning for High-Write Tables
Autovacuum triggers ANALYZE automatically, but default thresholds are tuned for medium-traffic tables. High-write tables (like an events table with millions of inserts per day) need more aggressive settings:
-- Default autovacuum ANALYZE trigger:
-- autovacuum_analyze_threshold = 50 (rows)
-- autovacuum_analyze_scale_factor = 0.2 (20% of table)
-- Total: 50 + (0.2 × n_live_tup) changed rows
-- For a 10M-row events table: threshold = 50 + (0.2 × 10,000,000) = 2,000,050 rows
-- This means stats could be VERY stale on a busy table
-- Override autovacuum settings per-table (no restart required)
ALTER TABLE events SET (
autovacuum_analyze_scale_factor = 0.01, -- 1% instead of 20%
autovacuum_analyze_threshold = 1000, -- 1,000 rows instead of 50
autovacuum_vacuum_scale_factor = 0.01,
autovacuum_vacuum_threshold = 1000
);
-- For a 10M-row table: threshold = 1,000 + (0.01 × 10,000,000) = 101,000 rows
-- Much more responsive to traffic patterns
-- Check autovacuum is running (not blocked)
SELECT
pid,
query,
wait_event_type,
wait_event,
age(clock_timestamp(), query_start) AS duration
FROM pg_stat_activity
WHERE query LIKE 'autovacuum%'
ORDER BY duration DESC;
-- Check tables that need ANALYZE most urgently
SELECT
schemaname || '.' || relname AS table,
n_mod_since_analyze,
last_autoanalyze,
n_live_tup
FROM pg_stat_user_tables
WHERE n_mod_since_analyze > 10000
ORDER BY n_mod_since_analyze DESC
LIMIT 20;
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
Extended Statistics: Correlated Columns
PostgreSQL assumes columns are independent by default. If they're actually correlated (e.g., country and currency), estimates can be wildly off:
-- Problem: planner assumes country='US' AND currency='USD' are independent
-- Actual correlation: 95% of US rows have currency='USD'
-- Planner estimate: 0.3 × 0.6 = 0.18 selectivity (too low)
-- Actual selectivity: ~0.3 (matches country='US')
-- Solution: extended statistics on correlated columns
CREATE STATISTICS stats_country_currency (dependencies)
ON country, currency
FROM orders;
ANALYZE orders;
-- Verify the statistics were collected
SELECT stxname, stxkeys, stxkind, stxdependencies
FROM pg_statistic_ext
JOIN pg_statistic_ext_data ON pg_statistic_ext.oid = pg_statistic_ext_data.stxoid
WHERE stxrelid = 'orders'::regclass;
-- After creating extended statistics, EXPLAIN should show better row estimates
EXPLAIN (ANALYZE)
SELECT * FROM orders
WHERE country = 'US' AND currency = 'USD';
-- Before: estimated X rows, actual Y rows (large discrepancy)
-- After: estimated X rows, actual Y rows (much closer)
-- Extended statistics types:
-- dependencies: column correlation (most common use case)
-- ndistinct: distinct value estimates for column groups
-- mcv: most common value combinations
CREATE STATISTICS stats_workspace_event (dependencies, ndistinct)
ON workspace_id, name
FROM events;
Diagnosing Bad Plans with EXPLAIN
-- Compare estimated vs actual rows — large discrepancy = stale/wrong stats
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT *
FROM events
WHERE workspace_id = '550e8400-e29b-41d4-a716-446655440000'
AND name = 'project.created'
AND occurred_at >= NOW() - INTERVAL '30 days';
-- Red flags in EXPLAIN output:
-- "rows=1 (actual rows=50000)" → stats very stale; ANALYZE needed
-- "Seq Scan" on large table with selective WHERE → wrong plan; may need ANALYZE + stats target increase
-- "Nested Loop" on large join → hash join would be better; row estimate likely wrong
-- Force the planner to reconsider after ANALYZE:
-- SET enable_seqscan = off; ← Only for debugging, not production
-- SET enable_nestloop = off; ← Ditto
-- Production-safe approach: check planner parameters
SHOW enable_hashjoin; -- on
SHOW enable_indexscan; -- on
SHOW random_page_cost; -- 4.0 (default, suitable for spinning disks)
-- On SSD/NVMe: lower this to reduce planner bias toward sequential scans
SET random_page_cost = 1.1; -- Globally: ALTER SYSTEM SET random_page_cost = 1.1; then SELECT pg_reload_conf();
SHOW effective_cache_size; -- 4GB default; set to ~75% of RAM
-- ALTER SYSTEM SET effective_cache_size = '12GB'; -- For 16GB RAM server
Planner Configuration for AWS RDS / Aurora
-- Check current settings on RDS (can't edit postgresql.conf directly)
SELECT name, setting, unit, source
FROM pg_settings
WHERE name IN (
'random_page_cost',
'effective_cache_size',
'work_mem',
'maintenance_work_mem',
'default_statistics_target',
'autovacuum_analyze_scale_factor'
)
ORDER BY name;
-- Change via RDS Parameter Group (Terraform):
-- resource "aws_db_parameter_group" "postgres" {
-- parameter { name = "random_page_cost" value = "1.1" }
-- parameter { name = "effective_cache_size" value = "12288" } # MB
-- parameter { name = "work_mem" value = "65536" } # 64MB
-- parameter { name = "default_statistics_target" value = "200" }
-- }
-- Check if a parameter requires restart
SELECT name, pending_restart
FROM pg_settings
WHERE pending_restart = true;
TypeScript: Statistics Health Check
// lib/db/stats-health.ts — run periodically to catch stale statistics
import { prisma } from "@/lib/prisma";
interface TableStatHealth {
table: string;
liveRows: number;
modsSinceAnalyze: number;
stalePct: number; // % of rows changed since last analyze
lastAutoanalyze: Date | null;
needsAnalyze: boolean;
}
export async function checkStatisticsHealth(): Promise<TableStatHealth[]> {
const rows = await prisma.$queryRaw<{
table_name: string;
n_live_tup: bigint;
n_mod_since_analyze: bigint;
last_autoanalyze: Date | null;
}[]>`
SELECT
relname AS table_name,
n_live_tup,
n_mod_since_analyze,
last_autoanalyze
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY n_mod_since_analyze DESC
LIMIT 20
`;
return rows.map((row) => {
const liveRows = Number(row.n_live_tup);
const modsSinceAnalyze = Number(row.n_mod_since_analyze);
const stalePct = liveRows > 0
? Math.round((modsSinceAnalyze / liveRows) * 100)
: 0;
return {
table: row.table_name,
liveRows,
modsSinceAnalyze,
stalePct,
lastAutoanalyze: row.last_autoanalyze,
needsAnalyze: stalePct > 10 || modsSinceAnalyze > 100_000,
};
});
}
Cost and Timeline Estimates
| Scope | Team | Timeline | Cost Range |
|---|---|---|---|
| Statistics audit (pg_stats + EXPLAIN review) | 1 dev | Half a day | $150–300 |
| Autovacuum tuning per-table | 1 dev | Half a day | $150–300 |
| Extended statistics for correlated columns | 1 dev | 1 day | $300–600 |
| Full query planner optimization engagement | 1–2 devs | 2–3 days | $800–1,500 |
See Also
- PostgreSQL Performance Tuning
- PostgreSQL EXPLAIN ANALYZE Deep Dive
- PostgreSQL Indexing Advanced Patterns
- PostgreSQL Partitioning Advanced Patterns
- AWS RDS Read Replicas
Working With Viprasol
Stale statistics are the most common cause of mysteriously slow queries after a table grows — the planner's row estimates are wrong so it picks a sequential scan instead of an index. Our team audits pg_stat_user_tables for tables with high n_mod_since_analyze, tunes autovacuum thresholds per-table for high-write workloads, sets per-column statistics_target for skewed distributions, and adds extended statistics for correlated column pairs.
What we deliver:
pg_statsquery: n_distinct, histogram_bounds, null_frac per columnpg_stat_user_tablesfreshness check: n_mod_since_analyze + last_autoanalyze- Per-table autovacuum override:
autovacuum_analyze_scale_factor = 0.01for high-write tables - Per-column
SET STATISTICS 500for skewed distributions + re-ANALYZE CREATE STATISTICS (dependencies)for correlated column pairs- RDS Parameter Group Terraform:
random_page_cost = 1.1,effective_cache_size,work_mem checkStatisticsHealth()TypeScript function: stalePct > 10% → needsAnalyze flag
Talk to our team about your PostgreSQL performance optimization →
Or explore our cloud and database services.
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.