The Ultimate Guide to DuckDB
DuckDB is the fastest embedded analytical database available. This guide covers what DuckDB is, how it works, key use cases, and how to get started.
DuckDB is an in-process analytical database management system. It's fast, zero-dependency, and designed for the workloads that matter most in data analysis and AI applications. If you've heard it described as "SQLite for analytics," that's a fair starting point — but DuckDB is much more capable than that comparison implies.
What Is DuckDB?#
DuckDB is an open-source, in-process SQL OLAP (Online Analytical Processing) database. Key characteristics:
- In-process: Runs inside your application, no separate database server needed
- Embedded: Stores data in a single
.duckdbfile - Columnar: Data is stored column-by-column, ideal for analytical queries
- SQL-complete: Supports the full SQL standard including window functions, CTEs, PIVOT
- Fast: Vectorized execution engine, typically 10-100x faster than SQLite for analytical queries
- Open source: MIT-licensed, actively maintained by DuckDB Labs
DuckDB was created at CWI Amsterdam (the same lab that created MonetDB) and first released in 2019. It reached 1.0 in 2024.
DuckDB vs. SQLite#
The most common comparison: DuckDB and SQLite are both embedded, single-file databases with no server required. They're designed for different workloads.
| Aspect | SQLite | DuckDB |
|---|---|---|
| Storage format | Row-oriented | Columnar |
| Best workload | Transactional (OLTP) | Analytical (OLAP) |
| Concurrent writes | Limited | Limited |
| Analytical query speed | Adequate | Excellent (10-100x faster) |
| Window functions | Limited | Full support |
| PIVOT/UNPIVOT | Manual | Native |
| JSON handling | Basic | Advanced |
| Parquet support | Via extension | Native |
| Memory use | Very low | Moderate |
| Maturity | Decades | Since 2019 |
Choose SQLite when: Your workload is primarily transactional (inserts/updates/single-row reads). You need the most mature embedded database.
Choose DuckDB when: Your workload involves analytics, aggregations, complex joins, or reading wide datasets. You need full SQL including window functions and PIVOT.
DenchClaw uses DuckDB because its CRM workload is predominantly analytical — pipeline reports, contact search, activity aggregations, funnel analysis.
Core DuckDB Concepts#
Column Store#
DuckDB stores data in columns rather than rows. For analytical queries that read few columns across many rows, this is dramatically more efficient.
If you have a table with 20 columns and 100,000 rows, and you query 3 of those columns:
- Row store: reads all 20 columns × 100,000 rows
- Column store: reads only 3 columns × 100,000 rows
For DenchClaw's CRM with many fields per contact, columnar storage means queries that filter on 2-3 fields read 10-20% of the data a row-store would read.
Vectorized Execution#
DuckDB processes data in batches (vectors) of 1,024 rows at a time, using SIMD instructions for parallel processing. This makes even simple operations significantly faster than row-by-row processing.
SQL Completeness#
DuckDB supports the full SQL standard. Key features beyond what SQLite provides:
Window functions:
SELECT
rep_name,
deal_value,
SUM(deal_value) OVER (PARTITION BY rep_name) as rep_total,
RANK() OVER (PARTITION BY rep_name ORDER BY deal_value DESC) as rank_in_rep
FROM deals;Native PIVOT:
PIVOT deals
ON stage
USING SUM(value) AS total_value, COUNT(*) AS deal_count
GROUP BY quarter;Recursive CTEs:
WITH RECURSIVE org_hierarchy AS (
SELECT id, manager_id, name, 0 as level FROM employees WHERE manager_id IS NULL
UNION ALL
SELECT e.id, e.manager_id, e.name, h.level + 1
FROM employees e JOIN org_hierarchy h ON e.manager_id = h.id
)
SELECT * FROM org_hierarchy;Direct File Reading#
DuckDB can read external files directly without importing:
-- Read a CSV without importing
SELECT * FROM read_csv('contacts.csv');
-- Read a Parquet file
SELECT COUNT(*) FROM 'data.parquet';
-- Read all Parquet files in a directory
SELECT * FROM 'data/*.parquet';This makes DuckDB excellent for data analysis pipelines where you want to query files directly.
Getting Started with DuckDB#
Installation#
# macOS
brew install duckdb
# Python
pip install duckdb
# Node.js
npm install duckdb
# Download CLI
# https://duckdb.org/docs/installation/Basic Usage (CLI)#
# Start DuckDB CLI
duckdb mydata.duckdb
# Create a table
CREATE TABLE contacts (
id INTEGER PRIMARY KEY,
name VARCHAR,
email VARCHAR,
company VARCHAR,
created_at TIMESTAMP DEFAULT NOW()
);
# Insert data
INSERT INTO contacts VALUES (1, 'Sarah Chen', 'sarah@stripe.com', 'Stripe', DEFAULT);
# Query
SELECT * FROM contacts WHERE company = 'Stripe';Python API#
import duckdb
# Connect to file (or :memory: for in-memory)
conn = duckdb.connect('mydata.duckdb')
# Create table from CSV
conn.execute("CREATE TABLE contacts AS SELECT * FROM read_csv('contacts.csv')")
# Analytical query
result = conn.execute("""
SELECT company, COUNT(*) as contacts, AVG(deal_value) as avg_deal
FROM contacts c JOIN deals d ON c.company = d.company
GROUP BY company
ORDER BY avg_deal DESC
""").fetchdf() # Returns Pandas DataFrame
print(result)Node.js API#
import Database from 'duckdb';
const db = new Database(':memory:');
const conn = db.connect();
conn.exec(`
CREATE TABLE contacts AS SELECT * FROM read_csv('/path/to/contacts.csv');
`);
conn.all(`
SELECT company, COUNT(*) as count FROM contacts GROUP BY company ORDER BY count DESC
`, (err, result) => {
console.log(result);
});DenchClaw uses DuckDB via Node.js for all CRM operations.
DuckDB for CRM Applications#
DenchClaw's use of DuckDB demonstrates its fitness for CRM workloads.
The EAV + PIVOT Pattern#
DenchClaw uses Entity-Attribute-Value storage with PIVOT views. This allows custom fields without schema migrations:
-- EAV storage
CREATE TABLE entry_fields (
entry_id INTEGER,
field_id INTEGER,
field_value VARCHAR
);
-- PIVOT view: flat access
CREATE VIEW v_people AS
PIVOT entry_fields
ON field_name IN ('Full Name', 'Email', 'Company', 'Status', 'Phone')
USING FIRST(field_value)
GROUP BY entry_id;This query that would be painful in SQLite is native and fast in DuckDB.
Full-Text Search#
DuckDB's FTS extension enables fast full-text search:
INSTALL fts;
LOAD fts;
PRAGMA create_fts_index('v_people', 'entry_id', 'Full Name', 'Company');
SELECT * FROM v_people WHERE fts_main_v_people.match_bm25(entry_id, 'stripe engineer') IS NOT NULL;Analytical Pipeline Queries#
-- Pipeline by stage with conversion
SELECT
stage,
COUNT(*) as deal_count,
SUM(CAST(value AS DECIMAL)) as total_value,
AVG(EXTRACT(day FROM NOW() - created_at)) as avg_days_in_stage
FROM v_deals
WHERE status = 'Active'
GROUP BY stage
ORDER BY total_value DESC;This returns in 2-3ms on a typical DenchClaw dataset. A similar query in a cloud CRM API takes 200-500ms.
DuckDB Extensions#
DuckDB has a rich extension ecosystem:
| Extension | Purpose |
|---|---|
httpfs | Read files from HTTP/S3/GCS |
fts | Full-text search |
parquet | Parquet file support (built-in) |
json | JSON processing (built-in) |
spatial | Geospatial operations |
excel | Read Excel files |
sqlite | Read SQLite databases |
postgres | Connect to PostgreSQL |
INSTALL httpfs;
LOAD httpfs;
-- Query a Parquet file directly from S3
SELECT * FROM 's3://my-bucket/data.parquet' LIMIT 10;DuckDB in Production#
When DuckDB Is Right for Production#
DuckDB is appropriate for production when:
- Workload is primarily analytical (read-heavy)
- Single-writer, multi-reader pattern
- Local or single-machine deployment
- Dataset fits comfortably on disk
DenchClaw is a production DuckDB deployment — a personal CRM with 10,000-100,000 records running on your laptop.
Limitations#
DuckDB is not appropriate for:
- High-concurrency multi-writer workloads
- Distributed systems requiring replication
- Very large datasets (multi-terabyte) on memory-constrained hardware
- Applications requiring strong ACID guarantees across processes
For these use cases, PostgreSQL or ClickHouse are more appropriate.
Backup Strategies#
# Simple file copy (works when DuckDB is not connected)
cp workspace.duckdb workspace.duckdb.backup
# Export to Parquet (while connected)
EXPORT DATABASE 'backup_dir/' (FORMAT PARQUET);
# Export specific tables
COPY (SELECT * FROM v_people) TO 'people_backup.parquet';Frequently Asked Questions#
How fast is DuckDB really?#
On analytical queries (GROUP BY, aggregations, full-table scans), DuckDB is typically 10-100x faster than SQLite. On transactional workloads (single-row inserts/updates), SQLite is faster. DuckDB's speed for analytics is what makes DenchClaw's instant CRM queries possible.
Can DuckDB handle millions of records?#
Yes. DuckDB handles hundreds of millions of rows well on modern hardware. For a CRM use case (even enterprise scale), DuckDB is more than sufficient.
Is DuckDB production-stable?#
Yes, since the 1.0 release in 2024. It's used in production by major data platforms and analytics companies.
What's the difference between DuckDB and ClickHouse?#
Both are columnar analytical databases. ClickHouse is a distributed server-based system designed for petabyte-scale analytics. DuckDB is embedded and single-machine. For personal and small-team use cases, DuckDB is simpler and more appropriate. ClickHouse scales further but requires infrastructure.
Can I use DuckDB with Pandas?#
Yes. DuckDB integrates tightly with Pandas: conn.execute("SELECT * FROM df").fetchdf() converts a DuckDB result to a Pandas DataFrame, and conn.register("df", pandas_df) registers a DataFrame as a queryable table.
Ready to try DenchClaw? Install in one command: npx denchclaw. Full setup guide →
