DuckDB Query Optimization: Speed Up Your Queries
DuckDB query optimization techniques: EXPLAIN ANALYZE, column pruning, predicate pushdown, indexes, and partition strategies. Make your queries faster.
DuckDB Query Optimization: Speed Up Your Queries
DuckDB is already fast by default — its vectorized execution engine, parallel processing, and columnar storage handle most queries efficiently without tuning. But when you're dealing with large datasets or complex queries, understanding how to optimize makes a real difference.
Here's the systematic approach.
Step 1: Understand What's Happening with EXPLAIN#
Before optimizing, understand what DuckDB is actually doing:
-- See the query plan
EXPLAIN SELECT * FROM events WHERE user_id = 'user-123';
-- See the plan with actual execution statistics
EXPLAIN ANALYZE SELECT * FROM events WHERE user_id = 'user-123';The output shows:
- Physical plan — what operations DuckDB will perform
- Estimated rows — how many rows DuckDB thinks each step processes
- Actual rows (with ANALYZE) — how many rows were actually processed
- Execution time — time spent in each operator
Look for:
- Large row counts at filter steps (means the filter is applied late)
- Full table scans where you expect an index scan
- Nested loop joins on large tables (usually should be hash joins)
Step 2: Column Pruning — Only Read What You Need#
DuckDB's columnar storage means it only reads columns you reference. Don't SELECT * when you only need a few columns:
-- Slow: reads all columns from disk
SELECT * FROM events WHERE event_type = 'purchase';
-- Fast: reads only 3 columns
SELECT user_id, revenue, occurred_at FROM events WHERE event_type = 'purchase';This matters most for wide tables (many columns) and Parquet files (columnar storage maximizes this benefit).
Step 3: Predicate Pushdown — Filter Early#
DuckDB automatically pushes filters down to the scan level, but you can help by:
1. Put the most selective filter first:
-- Less optimal: broader filter first
SELECT * FROM events
WHERE year = 2026 AND user_id = 'specific-user';
-- More optimal: most selective first (DuckDB may optimize automatically, but be explicit)
SELECT * FROM events
WHERE user_id = 'specific-user' AND year = 2026;2. Use range predicates on sorted columns:
-- DuckDB can skip zones of data when filtering sorted timestamp columns
SELECT * FROM events
WHERE occurred_at BETWEEN '2026-03-01' AND '2026-03-31';3. Push filters into CTEs and subqueries:
-- Less optimal: filter after join
SELECT e.*, u.segment
FROM events e
JOIN users u ON e.user_id = u.id
WHERE e.event_type = 'purchase' AND e.occurred_at > '2026-01-01';
-- More optimal: filter before join
WITH recent_purchases AS (
SELECT * FROM events
WHERE event_type = 'purchase' AND occurred_at > '2026-01-01'
)
SELECT rp.*, u.segment
FROM recent_purchases rp
JOIN users u ON rp.user_id = u.id;Step 4: Min/Max Zonemaps (Automatic)#
DuckDB automatically creates min/max statistics (zonemaps) for each column in each row group. These let DuckDB skip entire chunks of data without reading them.
To benefit from zonemaps:
- Insert data in sorted order when possible (especially for timestamp columns)
- Avoid random inserts that scatter values across row groups
-- When loading data, sort by the most commonly filtered column
INSERT INTO events
SELECT * FROM staging_events
ORDER BY occurred_at;If your data is pre-sorted by date (common for event logs), DuckDB can skip entire months of data when you filter by date range.
Step 5: ART Indexes for Point Lookups#
DuckDB supports Adaptive Radix Tree (ART) indexes for point lookups and range scans:
-- Create an index
CREATE INDEX idx_events_user_id ON events(user_id);
CREATE INDEX idx_events_occurred_at ON events(occurred_at);
-- Composite index
CREATE INDEX idx_events_user_type ON events(user_id, event_type);When indexes help:
- Point lookups:
WHERE user_id = 'specific-user' - Range scans with low selectivity:
WHERE user_id = 'x' AND occurred_at > '2026-01-01' - JOIN conditions on the indexed column
When indexes don't help:
- Aggregations that scan the whole table anyway
- Queries that return more than ~10% of rows
- Low-cardinality columns like
event_typewith only a few distinct values
-- Check if an index exists
SELECT * FROM duckdb_indexes() WHERE table_name = 'events';Step 6: Join Optimization#
DuckDB chooses join algorithms automatically, but you can influence them:
Build Side Selection#
For hash joins, DuckDB builds a hash table on the smaller relation. Ensure the smaller table is on the right side of the join when possible:
-- Less optimal: large table on right side
SELECT * FROM events -- large
JOIN users ON events.user_id = users.id; -- small
-- Same performance in practice (DuckDB re-orders), but explicit ordering helps clarity
SELECT * FROM users -- small
JOIN events ON users.id = events.user_id; -- largeAvoid Cross Joins#
-- Accidental cross join (missing join condition)
SELECT * FROM orders, customers; -- Every order × every customer!
-- Correct
SELECT * FROM orders JOIN customers ON orders.customer_id = customers.id;Use Appropriate Join Types#
-- Inner join: only matching rows
SELECT * FROM orders INNER JOIN customers ON orders.customer_id = customers.id;
-- Left join: all orders, NULL for unmatched customers
SELECT * FROM orders LEFT JOIN customers ON orders.customer_id = customers.id;
-- Semi join (exists): use WHERE EXISTS instead of JOIN + DISTINCT
SELECT DISTINCT o.customer_id FROM orders o
WHERE EXISTS (SELECT 1 FROM customers c WHERE c.id = o.customer_id AND c.segment = 'enterprise');Step 7: Aggregate Optimization#
Pre-aggregate When Joining#
-- Less optimal: join then aggregate
SELECT u.segment, SUM(e.revenue)
FROM events e
JOIN users u ON e.user_id = u.id
WHERE e.event_type = 'purchase'
GROUP BY u.segment;
-- More optimal: aggregate before join (fewer rows to join)
WITH user_revenue AS (
SELECT user_id, SUM(revenue) AS total_revenue
FROM events
WHERE event_type = 'purchase'
GROUP BY user_id
)
SELECT u.segment, SUM(ur.total_revenue)
FROM user_revenue ur
JOIN users u ON ur.user_id = u.id
GROUP BY u.segment;GROUPING SETS for Multiple Aggregations#
Instead of multiple queries:
-- Multiple queries (runs 3 table scans)
SELECT 'total' AS level, SUM(revenue) FROM events;
SELECT region, SUM(revenue) FROM events GROUP BY region;
SELECT region, product_id, SUM(revenue) FROM events GROUP BY region, product_id;
-- One query with GROUPING SETS (one table scan)
SELECT region, product_id, SUM(revenue)
FROM events
GROUP BY GROUPING SETS ((), (region), (region, product_id));Step 8: Memory Management#
For large queries, tune memory settings:
-- Set memory limit (default: 80% of RAM)
SET memory_limit = '8GB';
-- Set temp directory for spilling to disk
SET temp_directory = '/fast-ssd/duckdb_temp';
-- Control parallel threads (default: number of CPU cores)
SET threads = 8;Monitor memory usage:
SELECT current_usage, peak_usage FROM pragma_memory_usage();Step 9: Partitioned Files#
For Parquet data, use Hive-style partitioning to enable partition pruning:
-- Write partitioned Parquet
COPY (SELECT * FROM events) TO 'events_partitioned/'
(FORMAT PARQUET, PARTITION_BY (year, month));
-- Read with partition pruning (only reads 2026/03/)
SELECT * FROM read_parquet('events_partitioned/**/*.parquet', hive_partitioning=true)
WHERE year = 2026 AND month = 3;DuckDB skips the partitions that don't match the filter — no reads from other years/months.
DenchClaw Query Optimization#
DenchClaw uses a PIVOT view pattern in DuckDB where EAV data is reshaped into readable views. For analytics queries, the key optimization is filtering before the PIVOT:
-- Faster: filter in the EAV layer before pivoting
WITH filtered_entries AS (
SELECT entry_id FROM entries e
JOIN entry_fields ef ON e.id = ef.entry_id
JOIN fields f ON ef.field_id = f.id
WHERE f.name = 'Stage' AND ef.value = 'Proposal'
)
SELECT * FROM v_deals
WHERE id IN (SELECT entry_id FROM filtered_entries);Frequently Asked Questions#
How do I know if my query is using an index?#
Use EXPLAIN SELECT ... and look for INDEX_SCAN in the plan. If you see TABLE_SCAN, the index is not being used.
When should I add an index in DuckDB?#
Add indexes for columns that appear in WHERE clauses for point lookups, JOIN conditions, and high-cardinality columns you filter frequently. Don't add indexes on columns you never filter on.
Does DuckDB use parallel execution by default?#
Yes. DuckDB automatically uses all available CPU cores for query execution.
How can I make GROUP BY queries faster?#
Pre-filter data before grouping, use partial aggregation, and ensure the group-by columns have good cardinality. Very high cardinality GROUP BY (millions of groups) is inherently slow.
What's the biggest performance gain available?#
Eliminating unnecessary columns (SELECT *) and adding date-range filters to large time-series tables typically give the biggest wins.
Ready to try DenchClaw? Install in one command: npx denchclaw. Full setup guide →
