SQL Patterns for Geospatial Analytics in DuckDB: Clip, Intersect, Group, Quantiles, Time Windows


SQL Patterns for Geospatial Analytics in DuckDB: Clip, Intersect, Group, Quantiles, Time Windows

DuckDB + duckdb-spatial is a surprisingly capable analytics stack for geospatial workloads—especially when your data lives in Parquet/GeoParquet and your goal is fast, reproducible SQL rather than a full GIS server.

This post is a cookbook of SQL patterns I use constantly in analytical pipelines and map-driven apps:

  • Clip features to a boundary polygon (fast pre-filter + exact clip)
  • Intersections and area-weighted summaries
  • Grouping by region and by grid index (H3-like patterns)
  • Quantiles for classification (choropleths, outlier detection)
  • Time windows (rolling stats, monthly/seasonal aggregates)

Everything below is written to avoid Markdown-breaking nested fences: code blocks are indented, not triple-backticked.


Setup: load spatial and define your sources

Most workflows start by loading the spatial extension and reading Parquet/GeoParquet into views or tables.

INSTALL spatial;
LOAD spatial;

-- Example sources (replace with your paths)
CREATE OR REPLACE VIEW drought AS
SELECT * FROM read_parquet('data/drought/*.parquet');

CREATE OR REPLACE VIEW admin AS
SELECT * FROM read_parquet('data/admin/admin_adm1.parquet');

Typical columns you’ll see in practice:

  • drought: date, geom (cell/feature geometry), metrics like spi_3, spei_3, etc.
  • admin: adm1_id, name, geom (boundary polygon)

Pattern 1: Clip to a boundary (prefilter + exact clip)

Clipping is the foundation of “analytics within an area”: “SPI over Turkey”, “NDVI in a catchment”, etc.

Step A: Fast prefilter with bbox intersection

Most engines become fast when you reduce candidate rows early. Use bbox tests (or geometry intersects) to prefilter:

-- Pick one boundary
WITH boundary AS (
  SELECT geom
  FROM admin
  WHERE adm1_id = 34
  LIMIT 1
)
SELECT d.*
FROM drought d, boundary b
WHERE ST_Intersects(d.geom, b.geom);

This already cuts down data massively, but note: it returns features that touch the boundary, not the clipped geometry.

Step B: Exact clip geometry with ST_Intersection
WITH boundary AS (
  SELECT geom
  FROM admin
  WHERE adm1_id = 34
  LIMIT 1
)
SELECT
  d.* EXCLUDE (geom),
  ST_Intersection(d.geom, b.geom) AS geom
FROM drought d, boundary b
WHERE ST_Intersects(d.geom, b.geom);

Add a filter to remove empty results:

WITH boundary AS (
  SELECT geom
  FROM admin
  WHERE adm1_id = 34
  LIMIT 1
)
SELECT
  d.* EXCLUDE (geom),
  ST_Intersection(d.geom, b.geom) AS geom
FROM drought d, boundary b
WHERE ST_Intersects(d.geom, b.geom)
  AND NOT ST_IsEmpty(ST_Intersection(d.geom, b.geom));
Practical note

ST_Intersection is expensive. Prefer:

  1. prefilter with ST_Intersects (or bbox columns if you have them)
  2. only then run ST_Intersection

Pattern 2: Area-weighted summaries after intersection

Area-weighted stats are essential when your features are polygons (grid cells, land parcels) and you intersect them with boundaries.

Example: “mean SPI_3 for an admin region, weighted by intersected area”.

WITH boundary AS (
  SELECT adm1_id, name, geom
  FROM admin
  WHERE adm1_id = 34
  LIMIT 1
),
inter AS (
  SELECT
    d.date,
    b.adm1_id,
    b.name,
    ST_Area(ST_Intersection(d.geom, b.geom)) AS inter_area,
    d.spi_3
  FROM drought d, boundary b
  WHERE ST_Intersects(d.geom, b.geom)
)
SELECT
  date,
  adm1_id,
  name,
  SUM(spi_3 * inter_area) / NULLIF(SUM(inter_area), 0) AS spi3_area_weighted
FROM inter
GROUP BY date, adm1_id, name
ORDER BY date;
Why area weighting matters

If you simply do AVG(spi_3) over intersecting cells, a tiny sliver counts the same as a cell mostly inside the region. Area-weighting fixes that.


Pattern 3: “Share of region” metrics (percent coverage)

Common dashboard question: “What percent of this region is under severe drought?”

Define a threshold and compute area share.

WITH boundary AS (
  SELECT adm1_id, name, geom
  FROM admin
  WHERE adm1_id = 34
  LIMIT 1
),
inter AS (
  SELECT
    d.date,
    b.adm1_id,
    b.name,
    ST_Area(ST_Intersection(d.geom, b.geom)) AS inter_area,
    d.spi_3
  FROM drought d, boundary b
  WHERE ST_Intersects(d.geom, b.geom)
),
totals AS (
  SELECT
    date,
    adm1_id,
    name,
    SUM(inter_area) AS total_area
  FROM inter
  GROUP BY date, adm1_id, name
),
severe AS (
  SELECT
    date,
    adm1_id,
    name,
    SUM(inter_area) AS severe_area
  FROM inter
  WHERE spi_3 <= -1.5
  GROUP BY date, adm1_id, name
)
SELECT
  t.date,
  t.adm1_id,
  t.name,
  (COALESCE(s.severe_area, 0) / NULLIF(t.total_area, 0)) * 100.0 AS severe_pct_area
FROM totals t
LEFT JOIN severe s
  ON t.date = s.date AND t.adm1_id = s.adm1_id
ORDER BY t.date;

Pattern 4: Group by region for multi-region summaries

Instead of one boundary, run the same logic over all admin polygons (or a subset).

WITH inter AS (
  SELECT
    d.date,
    a.adm1_id,
    a.name,
    ST_Area(ST_Intersection(d.geom, a.geom)) AS inter_area,
    d.spi_3
  FROM drought d
  JOIN admin a
    ON ST_Intersects(d.geom, a.geom)
)
SELECT
  date,
  adm1_id,
  name,
  SUM(spi_3 * inter_area) / NULLIF(SUM(inter_area), 0) AS spi3_area_weighted
FROM inter
GROUP BY date, adm1_id, name
ORDER BY date, adm1_id;
Performance hint

This can be heavy if both datasets are large polygons.

Typical ways to speed it up:

  • prefilter using bbox columns (xmin/ymin/xmax/ymax) if available
  • reduce admin set (only one country, only a few regions)
  • compute and store a lookup table mapping grid cells to admin IDs (best for repeated dashboards)

Pattern 5: “Clip a boundary to features” (reverse intersection)

Sometimes you want the boundary clipped by features for rendering, or to produce “inside-only” geometries.

Example: building a dataset “drought cells clipped to each adm1”:

CREATE OR REPLACE TABLE drought_by_adm1 AS
SELECT
  d.date,
  a.adm1_id,
  a.name,
  d.spi_3,
  ST_Intersection(d.geom, a.geom) AS geom
FROM drought d
JOIN admin a
  ON ST_Intersects(d.geom, a.geom)
WHERE NOT ST_IsEmpty(ST_Intersection(d.geom, a.geom));

This is useful if you publish pre-clipped outputs for web maps.


Pattern 6: Quantiles (for choropleths and classification)

Quantiles are one of the most practical analytics tools for cartography:

  • 5-quantile = quintiles
  • 10-quantile = deciles
  • compare distribution changes month-to-month

Example: compute quantiles for SPI_3 on a single date.

WITH d AS (
  SELECT spi_3
  FROM drought
  WHERE date = DATE '2024-07-01'
    AND spi_3 IS NOT NULL
)
SELECT
  quantile_cont(spi_3, 0.10) AS q10,
  quantile_cont(spi_3, 0.25) AS q25,
  quantile_cont(spi_3, 0.50) AS q50,
  quantile_cont(spi_3, 0.75) AS q75,
  quantile_cont(spi_3, 0.90) AS q90
FROM d;
Pattern: assign quantile bins for mapping

Compute thresholds once, then label each feature.

WITH thresholds AS (
  SELECT
    quantile_cont(spi_3, 0.20) AS q20,
    quantile_cont(spi_3, 0.40) AS q40,
    quantile_cont(spi_3, 0.60) AS q60,
    quantile_cont(spi_3, 0.80) AS q80
  FROM drought
  WHERE date = DATE '2024-07-01'
    AND spi_3 IS NOT NULL
)
SELECT
  d.*,
  CASE
    WHEN d.spi_3 <= t.q20 THEN 1
    WHEN d.spi_3 <= t.q40 THEN 2
    WHEN d.spi_3 <= t.q60 THEN 3
    WHEN d.spi_3 <= t.q80 THEN 4
    ELSE 5
  END AS quintile
FROM drought d, thresholds t
WHERE d.date = DATE '2024-07-01';

Pattern 7: Time windows (rolling averages, seasonal context)

Time windows are essential for climate analytics and dashboards.

Assume:

  • date is a DATE
  • you have one row per geometry per date (or per month)
Rolling mean per feature (e.g., 3-month rolling average)
SELECT
  h3_index,
  date,
  AVG(spi_3) OVER (
    PARTITION BY h3_index
    ORDER BY date
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS spi3_roll3
FROM drought
WHERE spi_3 IS NOT NULL;
Rolling min/max (extremes)
SELECT
  h3_index,
  date,
  MIN(spi_3) OVER (
    PARTITION BY h3_index
    ORDER BY date
    ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
  ) AS spi3_min12,
  MAX(spi_3) OVER (
    PARTITION BY h3_index
    ORDER BY date
    ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
  ) AS spi3_max12
FROM drought
WHERE spi_3 IS NOT NULL;
Monthly aggregation (if your data is daily)
SELECT
  DATE_TRUNC('month', date) AS month,
  h3_index,
  AVG(spi_3) AS spi3_month_avg
FROM drought
GROUP BY month, h3_index
ORDER BY month, h3_index;

Pattern 8: “Latest available date” and trailing windows

In operational apps, you often want “latest month” without hardcoding it.

WITH latest AS (
  SELECT MAX(date) AS max_date
  FROM drought
)
SELECT *
FROM drought
WHERE date = (SELECT max_date FROM latest);

Trailing 12 months from latest:

WITH latest AS (
  SELECT MAX(date) AS max_date
  FROM drought
)
SELECT *
FROM drought
WHERE date > (SELECT max_date - INTERVAL 12 MONTH FROM latest)
  AND date <= (SELECT max_date FROM latest);

Pattern 9: Combine clip + time windows (region time series)

This is a common dashboard endpoint: “SPI time series for an admin region”.

WITH boundary AS (
  SELECT adm1_id, name, geom
  FROM admin
  WHERE adm1_id = 34
  LIMIT 1
),
inter AS (
  SELECT
    d.date,
    ST_Area(ST_Intersection(d.geom, b.geom)) AS inter_area,
    d.spi_3
  FROM drought d, boundary b
  WHERE ST_Intersects(d.geom, b.geom)
    AND d.spi_3 IS NOT NULL
),
region_daily AS (
  SELECT
    date,
    SUM(spi_3 * inter_area) / NULLIF(SUM(inter_area), 0) AS spi3_region
  FROM inter
  GROUP BY date
)
SELECT
  date,
  spi3_region,
  AVG(spi3_region) OVER (
    ORDER BY date
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS spi3_region_roll3
FROM region_daily
ORDER BY date;

Performance tips that actually move the needle

  1. Precompute boundary geometry once (CTE is fine; materialized table/view is better if reused).
  2. Use bbox columns (xmin/ymin/xmax/ymax) when possible to avoid expensive geometry operations early.
  3. Avoid running ST_Intersection unless you truly need clipped geometry or area-weighted stats.
  4. For repeated dashboards: build a lookup table between grid cells and admin IDs (cell → region mapping).
  5. Partition Parquet by time (year/month) so WHERE date BETWEEN ... reads less data.