Spatial Analysis with DuckDB: Spatial Extension Basics, Geometry, CRS, and “How DuckDB Thinks”
Spatial Analysis with DuckDB: Spatial Extension Basics, Geometry, CRS, and “How DuckDB Thinks”
DuckDB is not a GIS server. It’s an analytics database that runs embedded (CLI, Python, Node, WASM) and is extremely good at scanning columnar data (Parquet) and computing aggregations. With the spatial extension enabled, DuckDB becomes a powerful “SQL workbench” for geospatial work: you can ingest common GIS formats, run ST_ functions, and export analytics-friendly datasets.
This first post is the foundation for a full series. The goal is to help you build the right mental model so you don’t fall into the classic traps:
- silently wrong CRS
- inverted axis order assumptions
- invalid geometries breaking downstream ops
- huge slow spatial joins that should have been prefiltered
Everything is written in Astro-friendly Markdown, avoiding nested triple-backtick code fences (all code blocks are indented).
How DuckDB thinks about spatial data
It helps to treat DuckDB like this:
- A columnar analytics engine that excels at “scan → filter → aggregate”.
- Spatial functions are just functions applied to a
GEOMETRYcolumn. - There’s no persistent spatial index like PostGIS by default; performance comes from:
- reducing what you scan (partitioning, column pruning),
- prefiltering candidates (bbox filters),
- precomputing reusable relationships (lookup tables).
This mindset changes how you design spatial workflows:
- Instead of “store everything in a DB and query live”, you often:
- publish Parquet/GeoParquet as a data product,
- run repeatable SQL scripts to build derived datasets,
- cache/partition for predictable performance.
What the spatial extension adds
When you enable DuckDB’s spatial extension, you effectively unlock three capabilities:
1) A GEOMETRY type + spatial functions
You get a GEOMETRY type and a large set of ST_ functions, including:
- validation:
ST_IsValid,ST_MakeValid,ST_IsEmpty - measurements:
ST_Area,ST_Length - transforms:
ST_Transform,ST_SetSRID,ST_SRID - relationships:
ST_Intersects,ST_Within,ST_Contains - utility:
ST_GeometryType,ST_Envelope,ST_Centroid
2) GDAL IO (read common GIS formats)
Most modern workflows need to read files like:
- GeoPackage
- Shapefile
- GeoJSON
- many other OGR-supported formats
The spatial extension provides GDAL-backed IO so you can ingest data without a separate ETL tool.
3) A clean bridge to Parquet / GeoParquet
DuckDB is excellent at producing analytics outputs:
- Parquet for fast scan-based queries
- GeoParquet for interoperable geometry + metadata (when supported by your tooling)
Even when GeoParquet metadata writing differs by version/tool, DuckDB is still a great “build step” to produce stable, well-typed Parquet outputs.
Setup: install and load spatial
In DuckDB, extensions are explicit. In an ETL or analysis script, start with:
INSTALL spatial;
LOAD spatial;
If your environment blocks INSTALL, ensure the extension is already available and only run:
LOAD spatial;
Minimal dataset mental model: rows, columns, geometry
In DuckDB, a spatial dataset is just a table with:
- attributes:
id,name,date,value, … - a geometry column: commonly named
geomorgeometry
A typical shape for analytics:
- one feature per row
- geometry is WKB/WKT-backed internally (you treat it as
GEOMETRY) - CRS meaning comes from SRID and metadata (and, critically, your pipeline discipline)
The “don’t get fooled” checks (do these every time)
Most geospatial bugs are not “errors”. They’re silent mismatches that produce plausible outputs.
This section is a checklist you can run immediately after reading any dataset.
Check 1: geometry type distribution
You want to know what you’re working with:
SELECT
ST_GeometryType(geom) AS geom_type,
COUNT(*) AS n
FROM src
GROUP BY 1
ORDER BY n DESC;
Why it matters:
- many operations behave differently for POINT vs POLYGON
ST_MakeValidcan change POLYGON → MULTIPOLYGON- mixed types often indicate a messy source (or a unioned dataset)
Check 2: geometry validity (and empties)
Invalid geometries can break intersections, areas, or even map rendering.
Count invalid:
SELECT COUNT(*) AS invalid_geoms
FROM src
WHERE geom IS NOT NULL AND NOT ST_IsValid(geom);
Count empties:
SELECT COUNT(*) AS empty_geoms
FROM src
WHERE geom IS NULL OR ST_IsEmpty(geom);
If invalid is non-trivial, plan a repair step:
CREATE OR REPLACE TABLE cleaned AS
SELECT
* EXCLUDE (geom),
ST_MakeValid(geom) AS geom
FROM src
WHERE geom IS NOT NULL;
Then remove empties:
CREATE OR REPLACE TABLE cleaned AS
SELECT *
FROM cleaned
WHERE geom IS NOT NULL AND NOT ST_IsEmpty(geom);
Re-check type distribution after repair (it often changes):
SELECT
ST_GeometryType(geom) AS geom_type,
COUNT(*) AS n
FROM cleaned
GROUP BY 1
ORDER BY n DESC;
Check 3: SRID and CRS sanity
SRID is the number that claims which CRS the coordinates are in. It’s not magic—sources can be wrong or missing.
Inspect SRID range:
SELECT
MIN(ST_SRID(geom)) AS min_srid,
MAX(ST_SRID(geom)) AS max_srid
FROM src;
Common realities:
- SRID might be 0 / NULL (unknown)
- SRID might vary across rows (bad sign)
- SRID might claim EPSG:4326 but coordinates look like meters (also bad)
If SRID is missing but you know it from the source, set it explicitly:
CREATE OR REPLACE TABLE with_srid AS
SELECT
* EXCLUDE (geom),
ST_SetSRID(geom, 27700) AS geom
FROM src;
Only then transform:
CREATE OR REPLACE TABLE projected AS
SELECT
* EXCLUDE (geom),
ST_Transform(geom, 4326) AS geom
FROM with_srid;
Check 4: bbox sanity (fast CRS smell test)
Bounding boxes are the fastest way to detect CRS mistakes.
Compute extents:
SELECT
MIN(ST_XMin(geom)) AS xmin,
MIN(ST_YMin(geom)) AS ymin,
MAX(ST_XMax(geom)) AS xmax,
MAX(ST_YMax(geom)) AS ymax
FROM src
WHERE geom IS NOT NULL;
Interpretation heuristics:
- If you believe the data is EPSG:4326:
- longitude should roughly be within [-180, 180]
- latitude should roughly be within [-90, 90]
- If values are in the hundreds of thousands or millions, you’re likely in a projected CRS (meters), not degrees.
This single query catches a huge fraction of “my map is blank” issues.
Check 5: unit awareness (degrees vs meters)
This one causes subtle errors in analytics:
ST_AreaandST_Lengthare unit-dependent- in EPSG:4326, raw area/length is not meaningful in meters
- for area/length analytics, you generally want a projected CRS suitable for your region
A practical rule:
- keep a “portable” dataset in EPSG:4326 for interchange
- do measurement-heavy computations in a projected CRS, then store results as numeric columns
Example pattern:
-- Project to a meter-based CRS for measurements (example CRS only)
CREATE OR REPLACE TABLE measured AS
SELECT
*,
ST_Area(ST_Transform(geom, 3857)) AS area_m2_approx
FROM src;
(For accurate area measurements, prefer an equal-area CRS appropriate for your geography.)
A repeatable workflow: read → validate → transform → publish
This is the mental model that scales.
Read
Use GDAL-backed reading for GIS formats, or read from Parquet.
CREATE OR REPLACE TABLE src AS
SELECT *
FROM ST_Read('data/input/admin.gpkg', layer='adm1');
Or:
CREATE OR REPLACE TABLE src AS
SELECT *
FROM read_parquet('data/input/admin_adm1.parquet');
Validate
Fix invalid and remove empties:
CREATE OR REPLACE TABLE cleaned AS
SELECT
* EXCLUDE (geom),
ST_MakeValid(geom) AS geom
FROM src
WHERE geom IS NOT NULL;
CREATE OR REPLACE TABLE cleaned AS
SELECT *
FROM cleaned
WHERE geom IS NOT NULL AND NOT ST_IsEmpty(geom);
Transform
Ensure SRID and reproject to your publishing CRS:
CREATE OR REPLACE TABLE projected AS
SELECT
* EXCLUDE (geom),
ST_Transform(geom, 4326) AS geom
FROM cleaned;
Publish
Export as Parquet (and optionally GeoParquet depending on your toolchain):
COPY projected
TO 'data/output/admin_adm1.parquet'
(FORMAT PARQUET);
Optional: add bbox columns before publishing (useful for prefilters):
CREATE OR REPLACE TABLE published AS
SELECT
*,
ST_XMin(geom) AS xmin,
ST_YMin(geom) AS ymin,
ST_XMax(geom) AS xmax,
ST_YMax(geom) AS ymax
FROM projected;
COPY published
TO 'data/output/admin_adm1.parquet'
(FORMAT PARQUET);
Minimal reproducible project layout
A simple structure that scales from laptop experiments to CI builds:
geodata/
input/
admin.gpkg
roads.shp
sql/
01_read.sql
02_clean.sql
03_project.sql
04_publish.sql
output/
admin_adm1.parquet
duckdb/
pipeline.duckdb
Recommended practice:
- keep SQL scripts small and single-purpose
- publish versioned outputs (e.g., folder
output/v2026_01_27/) - always validate outputs by re-reading them (trust but verify)
A “first 10 minutes” checklist for any dataset
When you open a new dataset, run these in order:
- Load spatial extension
- Read into
src - Row count and sample rows
- Geometry type distribution
- Invalid and empty geometry counts
- SRID min/max
- Bbox extents and sanity check
- Decide publishing CRS (portable) and measurement CRS (analytics)
- Clean + project
- Export Parquet and re-read to confirm schema
A compact set of queries you can copy-paste:
-- types
SELECT ST_GeometryType(geom), COUNT(*) FROM src GROUP BY 1;
-- validity
SELECT COUNT(*) FROM src WHERE geom IS NOT NULL AND NOT ST_IsValid(geom);
-- srid
SELECT MIN(ST_SRID(geom)), MAX(ST_SRID(geom)) FROM src;
-- extent
SELECT
MIN(ST_XMin(geom)), MIN(ST_YMin(geom)),
MAX(ST_XMax(geom)), MAX(ST_YMax(geom))
FROM src
WHERE geom IS NOT NULL;