Spatial Analysis with DuckDB: Performance at Scale — Partitioning, BBox Prefilters, Lookup Tables, and Publishing


Spatial Analysis with DuckDB: Performance at Scale — Partitioning, BBox Prefilters, Lookup Tables, and Publishing

DuckDB Spatial is dangerously productive: it lets you run serious spatial analytics with nothing more than SQL and files. But when you move from “I can run this query” to “I can power a dashboard reliably,” the challenges shift.

At scale, your biggest enemies are:

  • scanning too much Parquet
  • doing spatial intersections repeatedly
  • shipping huge geometries when you only need numbers
  • unstable pipelines (quiet schema drift, CRS mistakes, unreproducible builds)

This post is the capstone: a practical playbook for making DuckDB Spatial fast, predictable, and publishable.

Everything below avoids Markdown-breaking nested triple-backtick fences (all code blocks are indented).


The performance mindset: make scans small, make geometry rare

In DuckDB, performance usually comes from reducing scan work and avoiding expensive geometry operations.

A useful hierarchy:

  1. Avoid reading files at all (partition pruning)
  2. If reading, avoid reading unnecessary columns (column pruning)
  3. If filtering spatially, prefilter with bbox columns first
  4. Avoid repeated ST_Intersects / ST_Intersection by precomputing relationships
  5. Only compute clipped geometry when you truly need geometry output

Parquet layout that scales: partitioning, file sizing, and column pruning

Partition by the filters you always use

For time-series geospatial datasets (climate, EO, sensors), the most common filter is time:

  • WHERE date BETWEEN ...
  • WHERE month = ...
  • “latest month”

So: partition by time first.

A typical cloud-friendly layout:

data/
  drought/
    year=2024/
      month=01/
        part-0000.parquet
        part-0001.parquet
      month=02/
        part-0000.parquet
    year=2025/
      month=01/
        part-0000.parquet

If your app frequently filters by region too, use a second-level partition, but only if cardinality is reasonable:

data/
  drought/
    country=TR/
      year=2024/
        month=01/
          part-0000.parquet
    country=GB/
      year=2024/
        month=01/
          part-0000.parquet

Partitioning is not magic; it simply makes it possible to skip entire folders.

Keep file sizes in a sane range

Too small: too many files, high overhead.
Too large: slow downloads, slow first query, poor parallelism.

A practical rule of thumb for analytics Parquet is “moderate chunk sizes” (not tiny, not gigantic). The best size depends on your distribution method (local disk vs object storage) and whether you query in-browser.

For browser-based DuckDB WASM, you generally want smaller units so you can fetch just what you need for the UI state.

Embrace column pruning: don’t scan geometry if you don’t need it

This is the #1 simple win many teams miss.

If your query is a summary by admin region and time, you often only need:

  • date
  • cell_id (or h3)
  • value columns (spi_3, spei_3, etc.)
  • maybe bbox columns

Not geometry.

Design your Parquet schemas with that in mind:

  • “analytics tables” with no geometry
  • “geometry tables” for rendering or occasional spatial operations

BBox prefilters: index-like speed without indexes

Without spatial indexes, you need a cheap spatial filter step.

You can store bbox columns per feature:

  • xmin, ymin, xmax, ymax

Then prefilter candidate features with numeric comparisons (very fast), before running expensive ST_Intersects.

Create bbox columns at publish time
CREATE OR REPLACE TABLE published AS
SELECT
  * EXCLUDE (geom),
  geom,
  ST_XMin(geom) AS xmin,
  ST_YMin(geom) AS ymin,
  ST_XMax(geom) AS xmax,
  ST_YMax(geom) AS ymax
FROM projected;
Use bbox prefilter in queries

Example: filter features that could intersect a boundary bbox:

WITH b AS (
  SELECT
    geom,
    ST_XMin(geom) AS bxmin,
    ST_YMin(geom) AS bymin,
    ST_XMax(geom) AS bxmax,
    ST_YMax(geom) AS bymax
  FROM admin
  WHERE adm1_id = 34
  LIMIT 1
)
SELECT d.*
FROM drought d, b
WHERE
  d.xmax >= b.bxmin AND d.xmin <= b.bxmax
  AND d.ymax >= b.bymin AND d.ymin <= b.bymax
  AND ST_Intersects(d.geom, b.geom);

The numeric checks massively reduce the number of geometries that ever touch ST_Intersects.


The biggest dashboard speed win: precomputed lookup tables

Repeated spatial joins kill dashboards.

If every request does:

  • “select all cells”
  • “spatial join with admin polygons”
  • “aggregate”
  • “repeat for different dates”

…it will be slow and expensive.

The fix is to compute the expensive geometry relationship once, then reuse it forever:

  • grid cell → admin region mapping
  • optional weights (intersection area fraction) for area-weighted metrics

This turns “spatial join every request” into “simple join on IDs”.


Lookup table pattern A: cell → region membership (fastest)

This works well if your cells are fully inside one region (or if you’re okay assigning by centroid).

Example using centroids:

CREATE OR REPLACE TABLE cell_admin_lookup AS
SELECT
  c.cell_id,
  a.adm1_id,
  a.name
FROM cells c
JOIN admin a
  ON ST_Contains(a.geom, ST_Centroid(c.geom));

Now all “region summaries” become normal joins:

SELECT
  d.date,
  l.adm1_id,
  l.name,
  AVG(d.spi_3) AS spi3_avg
FROM drought_metrics d
JOIN cell_admin_lookup l
  ON d.cell_id = l.cell_id
GROUP BY d.date, l.adm1_id, l.name
ORDER BY d.date, l.adm1_id;

This is extremely fast compared to repeated ST_Intersects.

When centroid membership is acceptable
  • gridded products (H3, square grids) where small boundary slivers don’t matter
  • dashboards prioritizing speed over exact area weighting
  • exploratory analysis

Lookup table pattern B: cell → region with area weights (accurate and still fast)

If your cells straddle boundaries and accuracy matters, compute weights once.

CREATE OR REPLACE TABLE cell_admin_weights AS
SELECT
  c.cell_id,
  a.adm1_id,
  a.name,
  ST_Area(ST_Intersection(c.geom, a.geom)) AS inter_area,
  ST_Area(c.geom) AS cell_area,
  ST_Area(ST_Intersection(c.geom, a.geom)) / NULLIF(ST_Area(c.geom), 0) AS w_cell_fraction
FROM cells c
JOIN admin a
  ON ST_Intersects(c.geom, a.geom)
WHERE NOT ST_IsEmpty(ST_Intersection(c.geom, a.geom));

Now compute area-weighted summaries without any geometry work:

SELECT
  d.date,
  w.adm1_id,
  w.name,
  SUM(d.spi_3 * w.w_cell_fraction) / NULLIF(SUM(w.w_cell_fraction), 0) AS spi3_weighted
FROM drought_metrics d
JOIN cell_admin_weights w
  ON d.cell_id = w.cell_id
GROUP BY d.date, w.adm1_id, w.name
ORDER BY d.date, w.adm1_id;

You pay the spatial join once during a build step, then every dashboard query is pure numeric joins and aggregates.


Separate “metrics” from “geometry” for scalable apps

A scalable data product often looks like two datasets:

1) Metrics table (no geometry)
  • cell_id
  • date
  • indicators (spi_3, spei_3, …)
2) Geometry table (rarely queried)
  • cell_id
  • geom
  • bbox columns

That allows:

  • time-series and dashboards to stay fast
  • map rendering to load geometry only when needed
  • DuckDB WASM to avoid downloading geometry for numeric widgets

Example “metrics-first” query for a dashboard card:

SELECT
  date,
  AVG(spi_3) AS global_spi3
FROM drought_metrics
WHERE date BETWEEN DATE '2024-01-01' AND DATE '2024-12-01'
GROUP BY date
ORDER BY date;

No geometry required.


Publishing strategy: versioned datasets, validation, and object storage

“Publishing” is where prototypes become products.

The goal is to make datasets:

  • versioned
  • reproducible
  • verifiable
  • easy to cache

Versioning: immutable releases

Use immutable output folders:

output/
  v2025-11-12/
    drought_metrics/
      year=2024/month=01/part-0000.parquet
    cell_admin_weights.parquet
    cells_geom.parquet
  v2026-01-27/
    ...

Benefits:

  • rollback is trivial
  • caches behave well (URLs don’t change)
  • you can compare two releases for regression checks

A simple “latest” pointer can be handled via:

  • a small manifest file, or
  • a stable symlink in environments that support it, or
  • a redirect at the edge (if you use a CDN/worker)

Validation in CI: trust but verify

You don’t need a huge system. A few checks catch most disasters:

1) Schema checks
  • expected columns exist
  • types match (date is date, ids are consistent)
  • geometry column exists where expected
2) CRS sanity checks (when geometry exists)
  • SRID is consistent
  • bbox makes sense for the CRS

Example bbox sanity:

SELECT
  MIN(xmin) AS xmin,
  MIN(ymin) AS ymin,
  MAX(xmax) AS xmax,
  MAX(ymax) AS ymax
FROM cells_geom;
3) Null and range checks

For drought indices:

  • count nulls

  • check min/max in plausible ranges

  • confirm expected date coverage

    SELECT MIN(date) AS min_date, MAX(date) AS max_date, COUNT(*) AS n_rows, SUM(CASE WHEN spi_3 IS NULL THEN 1 ELSE 0 END) AS spi3_nulls, MIN(spi_3) AS spi3_min, MAX(spi_3) AS spi3_max FROM drought_metrics;

4) Spot-check joins

Confirm that lookup tables cover most cells:

SELECT
  (SELECT COUNT(*) FROM drought_metrics) AS metrics_rows,
  (SELECT COUNT(*) FROM cell_admin_lookup) AS lookup_rows;

Or check unmapped cells:

SELECT COUNT(*) AS unmapped_cells
FROM (SELECT DISTINCT cell_id FROM drought_metrics) m
LEFT JOIN cell_admin_lookup l
  ON m.cell_id = l.cell_id
WHERE l.cell_id IS NULL;

Publishing to object storage (R2/S3) for downstream use

Once datasets are versioned and validated, publish to object storage:

  • Parquet partitions map naturally to “folders”
  • immutable version folders cache well
  • multiple consumers can read the same artifacts:
    • DuckDB local
    • DuckDB on servers
    • DuckDB WASM in the browser
    • Python/Polars/Arrow/Spark

The important shift:

  • object storage becomes your “data backend”
  • DuckDB becomes the universal query engine across environments

DuckDB WASM considerations (client-side querying)

If you plan to query in the browser:

  • keep geometry out of “default” queries
  • prefer metrics-first Parquet partitions
  • keep partitions small enough to fetch quickly
  • consider multi-resolution outputs (coarse for global views, fine for drilldowns)

A good pattern:

  • global widgets read small summary Parquet
  • map drilldowns read per-region/per-month partitions
  • geometry is fetched only for the current viewport and zoom level (or pre-tiled separately)

A scalable blueprint (putting it all together)

A production-friendly architecture usually ends up with:

  1. Build step (DuckDB SQL):
    • ingest raw sources
    • clean + validate
    • produce metrics Parquet partitions
    • produce geometry Parquet (optional)
    • produce lookup tables (membership or weights)
    • write versioned outputs
  2. CI validation:
    • schema + CRS sanity + range checks
  3. Publish:
    • upload version folder to R2/S3
  4. Consumption:
    • dashboards: metrics + lookup tables (no geometry)
    • maps: geometry + filtered metrics
    • browser: DuckDB WASM reads Parquet directly

This is the core “local-first to cloud-native” story.


Final checklist: make it fast, make it boring

If you want DuckDB spatial workloads to be fast and reliable:

  • Partition Parquet by your most common filters (usually time)
  • Keep files “not too small, not too large”
  • Split metrics from geometry whenever possible
  • Add bbox columns and use bbox prefilters
  • Precompute lookup tables to remove repeated spatial joins
  • Publish versioned outputs
  • Validate outputs automatically before release
  • Treat your Parquet/GeoParquet as the product