Back to The Times of Claw

DuckDB Bulk Import: Loading Large Datasets Fast

DuckDB bulk import from CSV, Parquet, JSON, Excel, and databases. Here are the fastest patterns for loading large datasets without bogging down your analysis.

Mark Rachapoom
Mark Rachapoom
·6 min read
DuckDB Bulk Import: Loading Large Datasets Fast

DuckDB Bulk Import: Loading Large Datasets Fast

DuckDB has excellent bulk import performance — it can load a 1GB CSV file in seconds. But the method you use matters. This guide covers every import format and the fastest approach for each.

The Fastest Pattern: Read Directly#

For CSV and Parquet, you often don't need to import at all. DuckDB can query files directly:

-- Query without importing
SELECT COUNT(*), SUM(revenue) FROM 'events.csv';
SELECT * FROM 'events.parquet';
SELECT * FROM read_parquet('events/*.parquet');

Only import when you need to:

  1. Query the same data many times (importing avoids re-parsing the file)
  2. Do complex multi-table joins (imported tables are faster)
  3. Apply transformations and store results

CSV Import#

Basic Import#

-- Auto-detect schema and import
CREATE TABLE events AS SELECT * FROM read_csv_auto('events.csv');
 
-- Or load into existing table
COPY events FROM 'events.csv' (HEADER TRUE, DELIMITER ',');

With Schema Control#

-- Explicit schema
CREATE TABLE events (
    event_id VARCHAR,
    user_id VARCHAR,
    event_type VARCHAR,
    revenue DECIMAL(10,2),
    occurred_at TIMESTAMP
);
 
-- Import with type specifications
COPY events FROM 'events.csv' (
    HEADER TRUE,
    DELIMITER ',',
    NULLSTR 'NULL',
    DATEFORMAT '%Y-%m-%d',
    TIMESTAMPFORMAT '%Y-%m-%d %H:%M:%S'
);

CSV Options Reference#

SELECT * FROM read_csv_auto('data.csv',
    header = true,           -- First row is headers
    delimiter = ',',         -- Field separator
    quote = '"',             -- Quote character
    escape = '"',            -- Escape character
    nullstr = '',            -- What represents NULL
    skip = 0,                -- Rows to skip at start
    max_line_size = 2097152, -- Max line size in bytes
    compression = 'auto',    -- 'auto', 'gzip', 'bz2', 'zstd', 'lz4'
    columns = {              -- Override column names/types
        'timestamp': 'TIMESTAMP',
        'amount': 'DECIMAL(10,2)'
    }
);

Multi-File CSV Import#

-- Import all CSVs from a directory
CREATE TABLE events AS 
SELECT * FROM read_csv_auto('data/events_*.csv');
 
-- Wildcards work across directories too
SELECT * FROM read_csv_auto('data/**/*.csv');

Handling Malformed CSV#

-- Skip rows that can't be parsed
SELECT * FROM read_csv_auto('messy.csv', 
    ignore_errors = true,
    max_line_size = 5242880  -- 5MB line limit for very long rows
);

Parquet Import#

Parquet is DuckDB's native format and the fastest to load:

-- Direct query (no import needed)
SELECT * FROM 'data.parquet';
 
-- Import into table
CREATE TABLE events AS SELECT * FROM 'events.parquet';
 
-- Multiple files
CREATE TABLE events AS 
SELECT * FROM read_parquet(['jan.parquet', 'feb.parquet', 'mar.parquet']);
 
-- With Hive partitioning
SELECT *, year, month 
FROM read_parquet('events_partitioned/**/*.parquet', hive_partitioning=true)
WHERE year = 2026;

JSON Import#

-- Auto-detect JSON structure
SELECT * FROM read_json_auto('data.json');
 
-- Newline-delimited JSON (NDJSON/JSONL)
SELECT * FROM read_ndjson_auto('events.ndjson');
 
-- Explicit schema
SELECT * FROM read_json('data.json', 
    columns = {
        'id': 'VARCHAR',
        'name': 'VARCHAR',
        'created_at': 'TIMESTAMP',
        'metadata': 'JSON'
    }
);
 
-- Nested JSON — access nested fields
SELECT 
    json_extract_string(data, '$.user.id') AS user_id,
    json_extract_string(data, '$.event.type') AS event_type,
    CAST(json_extract(data, '$.revenue') AS DECIMAL(10,2)) AS revenue
FROM read_json_auto('events.json');

Excel Import#

INSTALL excel;
LOAD excel;
 
-- Import from Excel
SELECT * FROM read_xlsx('contacts.xlsx');
 
-- Specific sheet
SELECT * FROM read_xlsx('contacts.xlsx', sheet = 'Sheet2');
 
-- With header row
SELECT * FROM read_xlsx('contacts.xlsx', header = true);

Database Import#

From PostgreSQL#

INSTALL postgres;
LOAD postgres;
 
ATTACH 'host=localhost dbname=production user=app password=secret' AS prod (TYPE POSTGRES);
 
-- Import a table
CREATE TABLE customers AS SELECT * FROM prod.customers;
 
-- Import with transformation
CREATE TABLE active_customers AS 
SELECT id, name, email, created_at
FROM prod.customers
WHERE status = 'active' AND created_at > '2025-01-01';

From SQLite#

INSTALL sqlite;
LOAD sqlite;
 
ATTACH 'legacy.db' AS sqlite_db (TYPE SQLITE);
CREATE TABLE contacts AS SELECT * FROM sqlite_db.contacts;

From MySQL#

INSTALL mysql;
LOAD mysql;
 
ATTACH 'host=localhost user=root database=crm' AS mysql_db (TYPE MYSQL);
CREATE TABLE deals AS SELECT * FROM mysql_db.deals;

S3 and Cloud Import#

INSTALL httpfs;
LOAD httpfs;
 
SET s3_region = 'us-east-1';
SET s3_access_key_id = 'YOUR_KEY';
SET s3_secret_access_key = 'YOUR_SECRET';
 
-- Import from S3
CREATE TABLE events AS 
SELECT * FROM read_parquet('s3://my-bucket/events/2026/*.parquet');
 
-- Import from HTTP
CREATE TABLE reference_data AS 
SELECT * FROM read_csv_auto('https://example.com/data.csv');

Performance Optimization for Bulk Import#

Insert Mode#

For large imports, use COPY instead of INSERT:

-- Fast: COPY
COPY events FROM 'events.csv' (HEADER TRUE);
 
-- Slower: INSERT INTO ... SELECT
INSERT INTO events SELECT * FROM read_csv_auto('events.csv');

COPY uses bulk loading optimizations; INSERT INTO ... SELECT goes through the normal insert path.

Transaction Batching#

For programmatic inserts, batch them:

import duckdb
 
con = duckdb.connect('db.duckdb')
 
# Fast: single executemany
con.executemany(
    "INSERT INTO events VALUES (?, ?, ?, ?)",
    [(r['id'], r['type'], r['value'], r['ts']) for r in records]
)
 
# Slow: individual inserts
for r in records:
    con.execute("INSERT INTO events VALUES (?, ?, ?, ?)", 
                [r['id'], r['type'], r['value'], r['ts']])

Parallel Import#

For very large datasets, split files and let DuckDB parallelize:

-- DuckDB automatically parallelizes multi-file reads
CREATE TABLE events AS 
SELECT * FROM read_parquet('events_shard_*.parquet');

DuckDB reads multiple files in parallel using all available CPU cores.

Import Then Index#

Create indexes after importing, not before:

-- Import first (no indexes)
CREATE TABLE events AS SELECT * FROM 'events.parquet';
 
-- Then create indexes (faster to build on full data)
CREATE INDEX idx_events_user ON events(user_id);
CREATE INDEX idx_events_time ON events(occurred_at);

Importing CRM Data into DenchClaw#

DenchClaw can import data from HubSpot, Salesforce, and other CRMs using its browser agent. But you can also import directly into the DuckDB database using these patterns:

import duckdb
import pandas as pd
 
# Load contacts from a CSV export
contacts_df = pd.read_csv('hubspot_contacts.csv')
 
# Connect to DenchClaw's database
con = duckdb.connect('/Users/you/.openclaw-dench/workspace/workspace.duckdb')
 
# Map to DenchClaw's EAV schema
# (In practice, use DenchClaw's import workflow which handles this automatically)
for _, row in contacts_df.iterrows():
    entry_id = str(uuid.uuid4())
    con.execute("INSERT INTO entries VALUES (?, ?, NOW(), NOW())", 
                [entry_id, 'contact_object_id'])
    con.execute("INSERT INTO entry_fields VALUES (?, ?, ?)",
                [entry_id, 'name_field_id', row['name']])

DenchClaw's built-in import handles this mapping automatically via the browser agent or the CRM skill.

Frequently Asked Questions#

How fast can DuckDB import a 1GB CSV?#

Typically 5-15 seconds on modern hardware, depending on the number of columns and CPU cores available.

Can I import compressed files directly?#

Yes. DuckDB auto-detects gzip, bz2, zstd, and lz4 compression by file extension: SELECT * FROM 'events.csv.gz'.

What's the largest file DuckDB can import?#

There's no hard limit. DuckDB streams data during import and can handle files larger than RAM by spilling to disk.

How do I handle CSV files with inconsistent schemas?#

Use read_csv_auto with ignore_errors=true to skip malformed rows, or use VARCHAR for all columns and cast manually.

Can I import data from an API?#

Yes, via read_json_auto('https://api.example.com/data') if the API returns JSON, or by downloading to a file first and importing.

Ready to try DenchClaw? Install in one command: npx denchclaw. Full setup guide →

Mark Rachapoom

Written by

Mark Rachapoom

Building the future of AI CRM software.

Continue reading

DENCH

© 2026 DenchHQ · San Francisco, CA