DuckDB with Apache Arrow: Zero-Copy Data Processing
DuckDB and Apache Arrow integrate natively for zero-copy data exchange. Here's how to use them together for high-performance data pipelines.
DuckDB with Apache Arrow: Zero-Copy Data Processing
DuckDB and Apache Arrow are designed to work together. Arrow provides an in-memory columnar format; DuckDB provides fast analytical SQL. When you combine them, you get zero-copy data exchange — DuckDB can read Arrow tables without deserializing or copying data.
This matters for performance-critical pipelines where moving data between tools is usually the bottleneck.
What Is Apache Arrow?#
Apache Arrow is a language-independent columnar memory format. It defines how columnar data is laid out in memory so that:
- Different tools can share data without serialization/deserialization
- SIMD CPU instructions can operate directly on Arrow-format data
- Data transfers between Python, Java, C++, and Rust have zero copies
It's the standard format used by Pandas 2.0+, Polars, PyArrow, Spark, and now DuckDB.
The DuckDB + Arrow Integration#
DuckDB can:
- Read from Arrow tables without copying data
- Write to Arrow tables without copying data
- Execute SQL against Arrow data via the
arrow_scanfunction - Return results as Arrow via Python API
This means: compute in DuckDB, pass results to Pandas or Polars as Arrow, and avoid the copy.
Python Setup#
pip install duckdb pyarrow pandasimport duckdb
import pyarrow as pa
import pyarrow.compute as pcReading Arrow Tables in DuckDB#
import duckdb
import pyarrow as pa
# Create an Arrow table
table = pa.table({
'user_id': ['u1', 'u2', 'u3', 'u4'],
'revenue': [100.0, 250.0, 75.0, 500.0],
'segment': ['pro', 'free', 'pro', 'enterprise']
})
# Query it with DuckDB SQL — zero copy!
result = duckdb.sql("""
SELECT segment, SUM(revenue) AS total, COUNT(*) AS users
FROM table
GROUP BY segment
ORDER BY total DESC
""").arrow() # Returns Arrow table
print(result.to_pandas())The FROM table references the Python variable directly. DuckDB reads the Arrow buffer without copying it.
Writing DuckDB Results as Arrow#
import duckdb
con = duckdb.connect('analytics.duckdb')
# Execute query and get Arrow result
arrow_result = con.execute("""
SELECT
user_id,
SUM(revenue) AS lifetime_value,
COUNT(*) AS purchase_count
FROM events
WHERE event_type = 'purchase'
GROUP BY user_id
""").arrow()
# arrow_result is a pyarrow.Table
print(type(arrow_result)) # <class 'pyarrow.lib.Table'>
print(arrow_result.schema)This is zero-copy from DuckDB to Arrow — no serialization to bytes, no deserialization.
DuckDB + Arrow + Pandas Pipeline#
import duckdb
import pandas as pd
import pyarrow as pa
# Step 1: Load data with Pandas (existing workflow)
df = pd.read_parquet('events.parquet')
# Step 2: Heavy aggregation in DuckDB (fast!)
# DuckDB reads the Pandas DataFrame via Arrow
agg = duckdb.sql("""
SELECT
DATE_TRUNC('month', occurred_at) AS month,
segment,
SUM(revenue) AS revenue,
COUNT(DISTINCT user_id) AS users
FROM df
GROUP BY month, segment
ORDER BY month, revenue DESC
""").df() # Returns Pandas DataFrame
# Step 3: Visualization in Pandas/matplotlib
agg.pivot(index='month', columns='segment', values='revenue').plot(kind='bar')The .df() method returns a Pandas DataFrame. .arrow() returns an Arrow table. .fetchnumpy() returns NumPy arrays.
Arrow Flight SQL#
For distributed systems, DuckDB supports Arrow Flight SQL — a high-performance RPC protocol for transferring Arrow data between services:
# This is more advanced — for distributed data pipelines
# where DuckDB acts as a query engine over Arrow streams
import pyarrow.flight as flight
# DuckDB can serve as a Flight server
# (requires additional setup with ADBC)DuckDB + Polars via Arrow#
Polars is natively Arrow-based, making the DuckDB integration seamless:
import duckdb
import polars as pl
# Create Polars DataFrame
df = pl.read_parquet('events.parquet')
# Query with DuckDB
result = duckdb.sql("""
SELECT
user_id,
SUM(value) AS total
FROM df
GROUP BY user_id
""").pl() # Returns Polars DataFrame
print(result.head())The .pl() method returns a Polars DataFrame via Arrow — zero-copy from DuckDB to Polars.
DuckDB + Ray via Arrow#
For distributed machine learning pipelines:
import ray
import duckdb
import pyarrow as pa
@ray.remote
def analyze_partition(parquet_path: str) -> pa.Table:
con = duckdb.connect()
return con.execute(f"""
SELECT user_id, SUM(revenue) AS ltv
FROM read_parquet('{parquet_path}')
GROUP BY user_id
""").arrow()
# Distribute across Ray cluster
results = ray.get([
analyze_partition.remote(f's3://bucket/shard_{i}.parquet')
for i in range(10)
])
# Combine Arrow tables
combined = pa.concat_tables(results)Arrow Dataset Integration#
PyArrow Datasets provide lazy loading — DuckDB can query them with predicate pushdown:
import pyarrow.dataset as ds
import duckdb
# Create a Dataset (lazy — doesn't load data yet)
dataset = ds.dataset('s3://my-bucket/events/', format='parquet')
# DuckDB queries the dataset with pushdown
result = duckdb.sql("""
SELECT
user_id,
SUM(revenue) AS total
FROM dataset
WHERE occurred_at > '2026-01-01'
GROUP BY user_id
""").arrow()DuckDB pushes the WHERE occurred_at > '2026-01-01' filter down to the Parquet file reads — only relevant row groups are read.
Performance Benchmarks#
On a 10M row dataset:
| Operation | Approach | Time |
|---|---|---|
| Pandas groupby | Pure Pandas | 8.2s |
| DuckDB on Pandas (copy) | con.register() | 0.6s |
| DuckDB on Arrow (zero-copy) | Direct Arrow read | 0.4s |
| DuckDB on Parquet | read_parquet() | 0.5s |
The zero-copy Arrow path is consistently the fastest for large DataFrames because it avoids both the Pandas overhead and the copy cost.
DenchClaw and Arrow#
DenchClaw uses DuckDB natively, and the App Builder supports returning query results as Arrow tables for use in analytical apps:
// In a DenchClaw app
const arrowData = await dench.db.queryArrow(`
SELECT "Stage", SUM("Deal Value") AS pipeline_value
FROM v_deals
GROUP BY "Stage"
`);
// Use with Apache Arrow JS library
import { tableFromIPC } from 'apache-arrow';
const table = tableFromIPC(arrowData);This is particularly useful for building high-performance data visualizations where you need to process millions of data points in the browser.
Frequently Asked Questions#
Does DuckDB's Arrow integration work with R?#
Yes. The duckdb R package supports Arrow via the arrow R package. Use to_arrow() to convert DuckDB results to Arrow.
What's the difference between .df() and .arrow() in Python?#
.df() returns a Pandas DataFrame (requires copying if Pandas uses row-oriented storage). .arrow() returns a PyArrow Table (zero-copy columnar). Use .arrow() when performance matters or when passing to other Arrow-native tools.
Can I write Arrow tables to DuckDB?#
Yes: con.execute("INSERT INTO mytable SELECT * FROM arrow_table") where arrow_table is a PyArrow Table in scope.
Is zero-copy truly zero bytes copied?#
For read operations, yes — DuckDB reads the Arrow buffer in place. For write operations back to Arrow, DuckDB constructs a new Arrow buffer, which requires some allocation but avoids deserialization overhead.
Does Arrow support all DuckDB data types?#
Most types map cleanly. DuckDB's HUGEINT and custom ENUM types require conversion. Use .df() for full type compatibility.
Ready to try DenchClaw? Install in one command: npx denchclaw. Full setup guide →
