DuckDB Query Optimization: Speed Up Your Queries

DuckDB query optimization techniques: EXPLAIN ANALYZE, column pruning, predicate pushdown, indexes, and partition strategies. Make your queries faster.

Mark Rachapoom

March 26, 2026·8 min read

DuckDB Query Optimization: Speed Up Your Queries

DuckDB is already fast by default — its vectorized execution engine, parallel processing, and columnar storage handle most queries efficiently without tuning. But when you're dealing with large datasets or complex queries, understanding how to optimize makes a real difference.

Here's the systematic approach.

Step 1: Understand What's Happening with EXPLAIN#

Before optimizing, understand what DuckDB is actually doing:

-- See the query plan
EXPLAIN SELECT * FROM events WHERE user_id = 'user-123';
 
-- See the plan with actual execution statistics
EXPLAIN ANALYZE SELECT * FROM events WHERE user_id = 'user-123';

The output shows:

Physical plan — what operations DuckDB will perform
Estimated rows — how many rows DuckDB thinks each step processes
Actual rows (with ANALYZE) — how many rows were actually processed
Execution time — time spent in each operator

Look for:

Large row counts at filter steps (means the filter is applied late)
Full table scans where you expect an index scan
Nested loop joins on large tables (usually should be hash joins)

Step 2: Column Pruning — Only Read What You Need#

DuckDB's columnar storage means it only reads columns you reference. Don't SELECT * when you only need a few columns:

-- Slow: reads all columns from disk
SELECT * FROM events WHERE event_type = 'purchase';
 
-- Fast: reads only 3 columns
SELECT user_id, revenue, occurred_at FROM events WHERE event_type = 'purchase';

This matters most for wide tables (many columns) and Parquet files (columnar storage maximizes this benefit).

Step 3: Predicate Pushdown — Filter Early#

DuckDB automatically pushes filters down to the scan level, but you can help by:

1. Put the most selective filter first:

-- Less optimal: broader filter first
SELECT * FROM events 
WHERE year = 2026 AND user_id = 'specific-user';
 
-- More optimal: most selective first (DuckDB may optimize automatically, but be explicit)
SELECT * FROM events 
WHERE user_id = 'specific-user' AND year = 2026;

2. Use range predicates on sorted columns:

-- DuckDB can skip zones of data when filtering sorted timestamp columns
SELECT * FROM events
WHERE occurred_at BETWEEN '2026-03-01' AND '2026-03-31';

3. Push filters into CTEs and subqueries:

-- Less optimal: filter after join
SELECT e.*, u.segment
FROM events e
JOIN users u ON e.user_id = u.id
WHERE e.event_type = 'purchase' AND e.occurred_at > '2026-01-01';
 
-- More optimal: filter before join
WITH recent_purchases AS (
    SELECT * FROM events
    WHERE event_type = 'purchase' AND occurred_at > '2026-01-01'
)
SELECT rp.*, u.segment
FROM recent_purchases rp
JOIN users u ON rp.user_id = u.id;

Step 4: Min/Max Zonemaps (Automatic)#

DuckDB automatically creates min/max statistics (zonemaps) for each column in each row group. These let DuckDB skip entire chunks of data without reading them.

To benefit from zonemaps:

Insert data in sorted order when possible (especially for timestamp columns)
Avoid random inserts that scatter values across row groups

-- When loading data, sort by the most commonly filtered column
INSERT INTO events
SELECT * FROM staging_events
ORDER BY occurred_at;

If your data is pre-sorted by date (common for event logs), DuckDB can skip entire months of data when you filter by date range.

Step 5: ART Indexes for Point Lookups#

DuckDB supports Adaptive Radix Tree (ART) indexes for point lookups and range scans:

-- Create an index
CREATE INDEX idx_events_user_id ON events(user_id);
CREATE INDEX idx_events_occurred_at ON events(occurred_at);
 
-- Composite index
CREATE INDEX idx_events_user_type ON events(user_id, event_type);

When indexes help:

Point lookups: WHERE user_id = 'specific-user'
Range scans with low selectivity: WHERE user_id = 'x' AND occurred_at > '2026-01-01'
JOIN conditions on the indexed column

When indexes don't help:

Aggregations that scan the whole table anyway
Queries that return more than ~10% of rows
Low-cardinality columns like event_type with only a few distinct values

-- Check if an index exists
SELECT * FROM duckdb_indexes() WHERE table_name = 'events';

Step 6: Join Optimization#

DuckDB chooses join algorithms automatically, but you can influence them:

Build Side Selection#

For hash joins, DuckDB builds a hash table on the smaller relation. Ensure the smaller table is on the right side of the join when possible:

-- Less optimal: large table on right side
SELECT * FROM events  -- large
JOIN users ON events.user_id = users.id;  -- small
 
-- Same performance in practice (DuckDB re-orders), but explicit ordering helps clarity
SELECT * FROM users  -- small
JOIN events ON users.id = events.user_id;  -- large

Avoid Cross Joins#

-- Accidental cross join (missing join condition)
SELECT * FROM orders, customers;  -- Every order × every customer!
 
-- Correct
SELECT * FROM orders JOIN customers ON orders.customer_id = customers.id;

Use Appropriate Join Types#

-- Inner join: only matching rows
SELECT * FROM orders INNER JOIN customers ON orders.customer_id = customers.id;
 
-- Left join: all orders, NULL for unmatched customers
SELECT * FROM orders LEFT JOIN customers ON orders.customer_id = customers.id;
 
-- Semi join (exists): use WHERE EXISTS instead of JOIN + DISTINCT
SELECT DISTINCT o.customer_id FROM orders o
WHERE EXISTS (SELECT 1 FROM customers c WHERE c.id = o.customer_id AND c.segment = 'enterprise');

Step 7: Aggregate Optimization#

Pre-aggregate When Joining#

-- Less optimal: join then aggregate
SELECT u.segment, SUM(e.revenue)
FROM events e
JOIN users u ON e.user_id = u.id
WHERE e.event_type = 'purchase'
GROUP BY u.segment;
 
-- More optimal: aggregate before join (fewer rows to join)
WITH user_revenue AS (
    SELECT user_id, SUM(revenue) AS total_revenue
    FROM events
    WHERE event_type = 'purchase'
    GROUP BY user_id
)
SELECT u.segment, SUM(ur.total_revenue)
FROM user_revenue ur
JOIN users u ON ur.user_id = u.id
GROUP BY u.segment;

GROUPING SETS for Multiple Aggregations#

Instead of multiple queries:

-- Multiple queries (runs 3 table scans)
SELECT 'total' AS level, SUM(revenue) FROM events;
SELECT region, SUM(revenue) FROM events GROUP BY region;
SELECT region, product_id, SUM(revenue) FROM events GROUP BY region, product_id;
 
-- One query with GROUPING SETS (one table scan)
SELECT region, product_id, SUM(revenue)
FROM events
GROUP BY GROUPING SETS ((), (region), (region, product_id));

Step 8: Memory Management#

For large queries, tune memory settings:

-- Set memory limit (default: 80% of RAM)
SET memory_limit = '8GB';
 
-- Set temp directory for spilling to disk
SET temp_directory = '/fast-ssd/duckdb_temp';
 
-- Control parallel threads (default: number of CPU cores)
SET threads = 8;

Monitor memory usage:

SELECT current_usage, peak_usage FROM pragma_memory_usage();

Step 9: Partitioned Files#

For Parquet data, use Hive-style partitioning to enable partition pruning:

-- Write partitioned Parquet
COPY (SELECT * FROM events) TO 'events_partitioned/' 
(FORMAT PARQUET, PARTITION_BY (year, month));
 
-- Read with partition pruning (only reads 2026/03/)
SELECT * FROM read_parquet('events_partitioned/**/*.parquet', hive_partitioning=true)
WHERE year = 2026 AND month = 3;

DuckDB skips the partitions that don't match the filter — no reads from other years/months.

DenchClaw Query Optimization#

DenchClaw uses a PIVOT view pattern in DuckDB where EAV data is reshaped into readable views. For analytics queries, the key optimization is filtering before the PIVOT:

-- Faster: filter in the EAV layer before pivoting
WITH filtered_entries AS (
    SELECT entry_id FROM entries e
    JOIN entry_fields ef ON e.id = ef.entry_id
    JOIN fields f ON ef.field_id = f.id
    WHERE f.name = 'Stage' AND ef.value = 'Proposal'
)
SELECT * FROM v_deals
WHERE id IN (SELECT entry_id FROM filtered_entries);