Spatial Analysis with DuckDB: Performance at Scale — Partitioning, BBox Prefilters, Lookup Tables, and Publishing
Spatial Analysis with DuckDB: Performance at Scale — Partitioning, BBox Prefilters, Lookup Tables, and Publishing
DuckDB Spatial is dangerously productive: it lets you run serious spatial analytics with nothing more than SQL and files. But when you move from “I can run this query” to “I can power a dashboard reliably,” the challenges shift.
At scale, your biggest enemies are:
- scanning too much Parquet
- doing spatial intersections repeatedly
- shipping huge geometries when you only need numbers
- unstable pipelines (quiet schema drift, CRS mistakes, unreproducible builds)
This post is the capstone: a practical playbook for making DuckDB Spatial fast, predictable, and publishable.
Everything below avoids Markdown-breaking nested triple-backtick fences (all code blocks are indented).
The performance mindset: make scans small, make geometry rare
In DuckDB, performance usually comes from reducing scan work and avoiding expensive geometry operations.
A useful hierarchy:
- Avoid reading files at all (partition pruning)
- If reading, avoid reading unnecessary columns (column pruning)
- If filtering spatially, prefilter with bbox columns first
- Avoid repeated
ST_Intersects/ST_Intersectionby precomputing relationships - Only compute clipped geometry when you truly need geometry output
Parquet layout that scales: partitioning, file sizing, and column pruning
Partition by the filters you always use
For time-series geospatial datasets (climate, EO, sensors), the most common filter is time:
WHERE date BETWEEN ...WHERE month = ...- “latest month”
So: partition by time first.
A typical cloud-friendly layout:
data/
drought/
year=2024/
month=01/
part-0000.parquet
part-0001.parquet
month=02/
part-0000.parquet
year=2025/
month=01/
part-0000.parquet
If your app frequently filters by region too, use a second-level partition, but only if cardinality is reasonable:
data/
drought/
country=TR/
year=2024/
month=01/
part-0000.parquet
country=GB/
year=2024/
month=01/
part-0000.parquet
Partitioning is not magic; it simply makes it possible to skip entire folders.
Keep file sizes in a sane range
Too small: too many files, high overhead.
Too large: slow downloads, slow first query, poor parallelism.
A practical rule of thumb for analytics Parquet is “moderate chunk sizes” (not tiny, not gigantic). The best size depends on your distribution method (local disk vs object storage) and whether you query in-browser.
For browser-based DuckDB WASM, you generally want smaller units so you can fetch just what you need for the UI state.
Embrace column pruning: don’t scan geometry if you don’t need it
This is the #1 simple win many teams miss.
If your query is a summary by admin region and time, you often only need:
datecell_id(or h3)- value columns (
spi_3,spei_3, etc.) - maybe bbox columns
Not geometry.
Design your Parquet schemas with that in mind:
- “analytics tables” with no geometry
- “geometry tables” for rendering or occasional spatial operations
BBox prefilters: index-like speed without indexes
Without spatial indexes, you need a cheap spatial filter step.
You can store bbox columns per feature:
xmin, ymin, xmax, ymax
Then prefilter candidate features with numeric comparisons (very fast), before running expensive ST_Intersects.
Create bbox columns at publish time
CREATE OR REPLACE TABLE published AS
SELECT
* EXCLUDE (geom),
geom,
ST_XMin(geom) AS xmin,
ST_YMin(geom) AS ymin,
ST_XMax(geom) AS xmax,
ST_YMax(geom) AS ymax
FROM projected;
Use bbox prefilter in queries
Example: filter features that could intersect a boundary bbox:
WITH b AS (
SELECT
geom,
ST_XMin(geom) AS bxmin,
ST_YMin(geom) AS bymin,
ST_XMax(geom) AS bxmax,
ST_YMax(geom) AS bymax
FROM admin
WHERE adm1_id = 34
LIMIT 1
)
SELECT d.*
FROM drought d, b
WHERE
d.xmax >= b.bxmin AND d.xmin <= b.bxmax
AND d.ymax >= b.bymin AND d.ymin <= b.bymax
AND ST_Intersects(d.geom, b.geom);
The numeric checks massively reduce the number of geometries that ever touch ST_Intersects.
The biggest dashboard speed win: precomputed lookup tables
Repeated spatial joins kill dashboards.
If every request does:
- “select all cells”
- “spatial join with admin polygons”
- “aggregate”
- “repeat for different dates”
…it will be slow and expensive.
The fix is to compute the expensive geometry relationship once, then reuse it forever:
- grid cell → admin region mapping
- optional weights (intersection area fraction) for area-weighted metrics
This turns “spatial join every request” into “simple join on IDs”.
Lookup table pattern A: cell → region membership (fastest)
This works well if your cells are fully inside one region (or if you’re okay assigning by centroid).
Example using centroids:
CREATE OR REPLACE TABLE cell_admin_lookup AS
SELECT
c.cell_id,
a.adm1_id,
a.name
FROM cells c
JOIN admin a
ON ST_Contains(a.geom, ST_Centroid(c.geom));
Now all “region summaries” become normal joins:
SELECT
d.date,
l.adm1_id,
l.name,
AVG(d.spi_3) AS spi3_avg
FROM drought_metrics d
JOIN cell_admin_lookup l
ON d.cell_id = l.cell_id
GROUP BY d.date, l.adm1_id, l.name
ORDER BY d.date, l.adm1_id;
This is extremely fast compared to repeated ST_Intersects.
When centroid membership is acceptable
- gridded products (H3, square grids) where small boundary slivers don’t matter
- dashboards prioritizing speed over exact area weighting
- exploratory analysis
Lookup table pattern B: cell → region with area weights (accurate and still fast)
If your cells straddle boundaries and accuracy matters, compute weights once.
CREATE OR REPLACE TABLE cell_admin_weights AS
SELECT
c.cell_id,
a.adm1_id,
a.name,
ST_Area(ST_Intersection(c.geom, a.geom)) AS inter_area,
ST_Area(c.geom) AS cell_area,
ST_Area(ST_Intersection(c.geom, a.geom)) / NULLIF(ST_Area(c.geom), 0) AS w_cell_fraction
FROM cells c
JOIN admin a
ON ST_Intersects(c.geom, a.geom)
WHERE NOT ST_IsEmpty(ST_Intersection(c.geom, a.geom));
Now compute area-weighted summaries without any geometry work:
SELECT
d.date,
w.adm1_id,
w.name,
SUM(d.spi_3 * w.w_cell_fraction) / NULLIF(SUM(w.w_cell_fraction), 0) AS spi3_weighted
FROM drought_metrics d
JOIN cell_admin_weights w
ON d.cell_id = w.cell_id
GROUP BY d.date, w.adm1_id, w.name
ORDER BY d.date, w.adm1_id;
You pay the spatial join once during a build step, then every dashboard query is pure numeric joins and aggregates.
Separate “metrics” from “geometry” for scalable apps
A scalable data product often looks like two datasets:
1) Metrics table (no geometry)
cell_iddate- indicators (
spi_3,spei_3, …)
2) Geometry table (rarely queried)
cell_idgeom- bbox columns
That allows:
- time-series and dashboards to stay fast
- map rendering to load geometry only when needed
- DuckDB WASM to avoid downloading geometry for numeric widgets
Example “metrics-first” query for a dashboard card:
SELECT
date,
AVG(spi_3) AS global_spi3
FROM drought_metrics
WHERE date BETWEEN DATE '2024-01-01' AND DATE '2024-12-01'
GROUP BY date
ORDER BY date;
No geometry required.
Publishing strategy: versioned datasets, validation, and object storage
“Publishing” is where prototypes become products.
The goal is to make datasets:
- versioned
- reproducible
- verifiable
- easy to cache
Versioning: immutable releases
Use immutable output folders:
output/
v2025-11-12/
drought_metrics/
year=2024/month=01/part-0000.parquet
cell_admin_weights.parquet
cells_geom.parquet
v2026-01-27/
...
Benefits:
- rollback is trivial
- caches behave well (URLs don’t change)
- you can compare two releases for regression checks
A simple “latest” pointer can be handled via:
- a small manifest file, or
- a stable symlink in environments that support it, or
- a redirect at the edge (if you use a CDN/worker)
Validation in CI: trust but verify
You don’t need a huge system. A few checks catch most disasters:
1) Schema checks
- expected columns exist
- types match (date is date, ids are consistent)
- geometry column exists where expected
2) CRS sanity checks (when geometry exists)
- SRID is consistent
- bbox makes sense for the CRS
Example bbox sanity:
SELECT
MIN(xmin) AS xmin,
MIN(ymin) AS ymin,
MAX(xmax) AS xmax,
MAX(ymax) AS ymax
FROM cells_geom;
3) Null and range checks
For drought indices:
-
count nulls
-
check min/max in plausible ranges
-
confirm expected date coverage
SELECT MIN(date) AS min_date, MAX(date) AS max_date, COUNT(*) AS n_rows, SUM(CASE WHEN spi_3 IS NULL THEN 1 ELSE 0 END) AS spi3_nulls, MIN(spi_3) AS spi3_min, MAX(spi_3) AS spi3_max FROM drought_metrics;
4) Spot-check joins
Confirm that lookup tables cover most cells:
SELECT
(SELECT COUNT(*) FROM drought_metrics) AS metrics_rows,
(SELECT COUNT(*) FROM cell_admin_lookup) AS lookup_rows;
Or check unmapped cells:
SELECT COUNT(*) AS unmapped_cells
FROM (SELECT DISTINCT cell_id FROM drought_metrics) m
LEFT JOIN cell_admin_lookup l
ON m.cell_id = l.cell_id
WHERE l.cell_id IS NULL;
Publishing to object storage (R2/S3) for downstream use
Once datasets are versioned and validated, publish to object storage:
- Parquet partitions map naturally to “folders”
- immutable version folders cache well
- multiple consumers can read the same artifacts:
- DuckDB local
- DuckDB on servers
- DuckDB WASM in the browser
- Python/Polars/Arrow/Spark
The important shift:
- object storage becomes your “data backend”
- DuckDB becomes the universal query engine across environments
DuckDB WASM considerations (client-side querying)
If you plan to query in the browser:
- keep geometry out of “default” queries
- prefer metrics-first Parquet partitions
- keep partitions small enough to fetch quickly
- consider multi-resolution outputs (coarse for global views, fine for drilldowns)
A good pattern:
- global widgets read small summary Parquet
- map drilldowns read per-region/per-month partitions
- geometry is fetched only for the current viewport and zoom level (or pre-tiled separately)
A scalable blueprint (putting it all together)
A production-friendly architecture usually ends up with:
- Build step (DuckDB SQL):
- ingest raw sources
- clean + validate
- produce metrics Parquet partitions
- produce geometry Parquet (optional)
- produce lookup tables (membership or weights)
- write versioned outputs
- CI validation:
- schema + CRS sanity + range checks
- Publish:
- upload version folder to R2/S3
- Consumption:
- dashboards: metrics + lookup tables (no geometry)
- maps: geometry + filtered metrics
- browser: DuckDB WASM reads Parquet directly
This is the core “local-first to cloud-native” story.
Final checklist: make it fast, make it boring
If you want DuckDB spatial workloads to be fast and reliable:
- Partition Parquet by your most common filters (usually time)
- Keep files “not too small, not too large”
- Split metrics from geometry whenever possible
- Add bbox columns and use bbox prefilters
- Precompute lookup tables to remove repeated spatial joins
- Publish versioned outputs
- Validate outputs automatically before release
- Treat your Parquet/GeoParquet as the product