SQL Patterns for Geospatial Analytics in DuckDB: Clip, Intersect, Group, Quantiles, Time Windows
SQL Patterns for Geospatial Analytics in DuckDB: Clip, Intersect, Group, Quantiles, Time Windows
DuckDB + duckdb-spatial is a surprisingly capable analytics stack for geospatial workloads—especially when your data lives in Parquet/GeoParquet and your goal is fast, reproducible SQL rather than a full GIS server.
This post is a cookbook of SQL patterns I use constantly in analytical pipelines and map-driven apps:
- Clip features to a boundary polygon (fast pre-filter + exact clip)
- Intersections and area-weighted summaries
- Grouping by region and by grid index (H3-like patterns)
- Quantiles for classification (choropleths, outlier detection)
- Time windows (rolling stats, monthly/seasonal aggregates)
Everything below is written to avoid Markdown-breaking nested fences: code blocks are indented, not triple-backticked.
Setup: load spatial and define your sources
Most workflows start by loading the spatial extension and reading Parquet/GeoParquet into views or tables.
INSTALL spatial;
LOAD spatial;
-- Example sources (replace with your paths)
CREATE OR REPLACE VIEW drought AS
SELECT * FROM read_parquet('data/drought/*.parquet');
CREATE OR REPLACE VIEW admin AS
SELECT * FROM read_parquet('data/admin/admin_adm1.parquet');
Typical columns you’ll see in practice:
drought:date,geom(cell/feature geometry), metrics likespi_3,spei_3, etc.admin:adm1_id,name,geom(boundary polygon)
Pattern 1: Clip to a boundary (prefilter + exact clip)
Clipping is the foundation of “analytics within an area”: “SPI over Turkey”, “NDVI in a catchment”, etc.
Step A: Fast prefilter with bbox intersection
Most engines become fast when you reduce candidate rows early. Use bbox tests (or geometry intersects) to prefilter:
-- Pick one boundary
WITH boundary AS (
SELECT geom
FROM admin
WHERE adm1_id = 34
LIMIT 1
)
SELECT d.*
FROM drought d, boundary b
WHERE ST_Intersects(d.geom, b.geom);
This already cuts down data massively, but note: it returns features that touch the boundary, not the clipped geometry.
Step B: Exact clip geometry with ST_Intersection
WITH boundary AS (
SELECT geom
FROM admin
WHERE adm1_id = 34
LIMIT 1
)
SELECT
d.* EXCLUDE (geom),
ST_Intersection(d.geom, b.geom) AS geom
FROM drought d, boundary b
WHERE ST_Intersects(d.geom, b.geom);
Add a filter to remove empty results:
WITH boundary AS (
SELECT geom
FROM admin
WHERE adm1_id = 34
LIMIT 1
)
SELECT
d.* EXCLUDE (geom),
ST_Intersection(d.geom, b.geom) AS geom
FROM drought d, boundary b
WHERE ST_Intersects(d.geom, b.geom)
AND NOT ST_IsEmpty(ST_Intersection(d.geom, b.geom));
Practical note
ST_Intersection is expensive. Prefer:
- prefilter with
ST_Intersects(or bbox columns if you have them) - only then run
ST_Intersection
Pattern 2: Area-weighted summaries after intersection
Area-weighted stats are essential when your features are polygons (grid cells, land parcels) and you intersect them with boundaries.
Example: “mean SPI_3 for an admin region, weighted by intersected area”.
WITH boundary AS (
SELECT adm1_id, name, geom
FROM admin
WHERE adm1_id = 34
LIMIT 1
),
inter AS (
SELECT
d.date,
b.adm1_id,
b.name,
ST_Area(ST_Intersection(d.geom, b.geom)) AS inter_area,
d.spi_3
FROM drought d, boundary b
WHERE ST_Intersects(d.geom, b.geom)
)
SELECT
date,
adm1_id,
name,
SUM(spi_3 * inter_area) / NULLIF(SUM(inter_area), 0) AS spi3_area_weighted
FROM inter
GROUP BY date, adm1_id, name
ORDER BY date;
Why area weighting matters
If you simply do AVG(spi_3) over intersecting cells, a tiny sliver counts the same as a cell mostly inside the region. Area-weighting fixes that.
Pattern 3: “Share of region” metrics (percent coverage)
Common dashboard question: “What percent of this region is under severe drought?”
Define a threshold and compute area share.
WITH boundary AS (
SELECT adm1_id, name, geom
FROM admin
WHERE adm1_id = 34
LIMIT 1
),
inter AS (
SELECT
d.date,
b.adm1_id,
b.name,
ST_Area(ST_Intersection(d.geom, b.geom)) AS inter_area,
d.spi_3
FROM drought d, boundary b
WHERE ST_Intersects(d.geom, b.geom)
),
totals AS (
SELECT
date,
adm1_id,
name,
SUM(inter_area) AS total_area
FROM inter
GROUP BY date, adm1_id, name
),
severe AS (
SELECT
date,
adm1_id,
name,
SUM(inter_area) AS severe_area
FROM inter
WHERE spi_3 <= -1.5
GROUP BY date, adm1_id, name
)
SELECT
t.date,
t.adm1_id,
t.name,
(COALESCE(s.severe_area, 0) / NULLIF(t.total_area, 0)) * 100.0 AS severe_pct_area
FROM totals t
LEFT JOIN severe s
ON t.date = s.date AND t.adm1_id = s.adm1_id
ORDER BY t.date;
Pattern 4: Group by region for multi-region summaries
Instead of one boundary, run the same logic over all admin polygons (or a subset).
WITH inter AS (
SELECT
d.date,
a.adm1_id,
a.name,
ST_Area(ST_Intersection(d.geom, a.geom)) AS inter_area,
d.spi_3
FROM drought d
JOIN admin a
ON ST_Intersects(d.geom, a.geom)
)
SELECT
date,
adm1_id,
name,
SUM(spi_3 * inter_area) / NULLIF(SUM(inter_area), 0) AS spi3_area_weighted
FROM inter
GROUP BY date, adm1_id, name
ORDER BY date, adm1_id;
Performance hint
This can be heavy if both datasets are large polygons.
Typical ways to speed it up:
- prefilter using bbox columns (
xmin/ymin/xmax/ymax) if available - reduce admin set (only one country, only a few regions)
- compute and store a lookup table mapping grid cells to admin IDs (best for repeated dashboards)
Pattern 5: “Clip a boundary to features” (reverse intersection)
Sometimes you want the boundary clipped by features for rendering, or to produce “inside-only” geometries.
Example: building a dataset “drought cells clipped to each adm1”:
CREATE OR REPLACE TABLE drought_by_adm1 AS
SELECT
d.date,
a.adm1_id,
a.name,
d.spi_3,
ST_Intersection(d.geom, a.geom) AS geom
FROM drought d
JOIN admin a
ON ST_Intersects(d.geom, a.geom)
WHERE NOT ST_IsEmpty(ST_Intersection(d.geom, a.geom));
This is useful if you publish pre-clipped outputs for web maps.
Pattern 6: Quantiles (for choropleths and classification)
Quantiles are one of the most practical analytics tools for cartography:
- 5-quantile = quintiles
- 10-quantile = deciles
- compare distribution changes month-to-month
Example: compute quantiles for SPI_3 on a single date.
WITH d AS (
SELECT spi_3
FROM drought
WHERE date = DATE '2024-07-01'
AND spi_3 IS NOT NULL
)
SELECT
quantile_cont(spi_3, 0.10) AS q10,
quantile_cont(spi_3, 0.25) AS q25,
quantile_cont(spi_3, 0.50) AS q50,
quantile_cont(spi_3, 0.75) AS q75,
quantile_cont(spi_3, 0.90) AS q90
FROM d;
Pattern: assign quantile bins for mapping
Compute thresholds once, then label each feature.
WITH thresholds AS (
SELECT
quantile_cont(spi_3, 0.20) AS q20,
quantile_cont(spi_3, 0.40) AS q40,
quantile_cont(spi_3, 0.60) AS q60,
quantile_cont(spi_3, 0.80) AS q80
FROM drought
WHERE date = DATE '2024-07-01'
AND spi_3 IS NOT NULL
)
SELECT
d.*,
CASE
WHEN d.spi_3 <= t.q20 THEN 1
WHEN d.spi_3 <= t.q40 THEN 2
WHEN d.spi_3 <= t.q60 THEN 3
WHEN d.spi_3 <= t.q80 THEN 4
ELSE 5
END AS quintile
FROM drought d, thresholds t
WHERE d.date = DATE '2024-07-01';
Pattern 7: Time windows (rolling averages, seasonal context)
Time windows are essential for climate analytics and dashboards.
Assume:
dateis a DATE- you have one row per geometry per date (or per month)
Rolling mean per feature (e.g., 3-month rolling average)
SELECT
h3_index,
date,
AVG(spi_3) OVER (
PARTITION BY h3_index
ORDER BY date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS spi3_roll3
FROM drought
WHERE spi_3 IS NOT NULL;
Rolling min/max (extremes)
SELECT
h3_index,
date,
MIN(spi_3) OVER (
PARTITION BY h3_index
ORDER BY date
ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
) AS spi3_min12,
MAX(spi_3) OVER (
PARTITION BY h3_index
ORDER BY date
ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
) AS spi3_max12
FROM drought
WHERE spi_3 IS NOT NULL;
Monthly aggregation (if your data is daily)
SELECT
DATE_TRUNC('month', date) AS month,
h3_index,
AVG(spi_3) AS spi3_month_avg
FROM drought
GROUP BY month, h3_index
ORDER BY month, h3_index;
Pattern 8: “Latest available date” and trailing windows
In operational apps, you often want “latest month” without hardcoding it.
WITH latest AS (
SELECT MAX(date) AS max_date
FROM drought
)
SELECT *
FROM drought
WHERE date = (SELECT max_date FROM latest);
Trailing 12 months from latest:
WITH latest AS (
SELECT MAX(date) AS max_date
FROM drought
)
SELECT *
FROM drought
WHERE date > (SELECT max_date - INTERVAL 12 MONTH FROM latest)
AND date <= (SELECT max_date FROM latest);
Pattern 9: Combine clip + time windows (region time series)
This is a common dashboard endpoint: “SPI time series for an admin region”.
WITH boundary AS (
SELECT adm1_id, name, geom
FROM admin
WHERE adm1_id = 34
LIMIT 1
),
inter AS (
SELECT
d.date,
ST_Area(ST_Intersection(d.geom, b.geom)) AS inter_area,
d.spi_3
FROM drought d, boundary b
WHERE ST_Intersects(d.geom, b.geom)
AND d.spi_3 IS NOT NULL
),
region_daily AS (
SELECT
date,
SUM(spi_3 * inter_area) / NULLIF(SUM(inter_area), 0) AS spi3_region
FROM inter
GROUP BY date
)
SELECT
date,
spi3_region,
AVG(spi3_region) OVER (
ORDER BY date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS spi3_region_roll3
FROM region_daily
ORDER BY date;
Performance tips that actually move the needle
- Precompute boundary geometry once (CTE is fine; materialized table/view is better if reused).
- Use bbox columns (
xmin/ymin/xmax/ymax) when possible to avoid expensive geometry operations early. - Avoid running
ST_Intersectionunless you truly need clipped geometry or area-weighted stats. - For repeated dashboards: build a lookup table between grid cells and admin IDs (cell → region mapping).
- Partition Parquet by time (year/month) so
WHERE date BETWEEN ...reads less data.