Spatial Analysis with DuckDB: Spatial Extension Basics, Geometry, CRS, and “How DuckDB Thinks”

Nov 19, 2025

Spatial Analysis with DuckDB: Spatial Extension Basics, Geometry, CRS, and “How DuckDB Thinks”

DuckDB is not a GIS server. It’s an analytics database that runs embedded (CLI, Python, Node, WASM) and is extremely good at scanning columnar data (Parquet) and computing aggregations. With the spatial extension enabled, DuckDB becomes a powerful “SQL workbench” for geospatial work: you can ingest common GIS formats, run ST_ functions, and export analytics-friendly datasets.

This first post is the foundation for a full series. The goal is to help you build the right mental model so you don’t fall into the classic traps:

silently wrong CRS
inverted axis order assumptions
invalid geometries breaking downstream ops
huge slow spatial joins that should have been prefiltered

Everything is written in Astro-friendly Markdown, avoiding nested triple-backtick code fences (all code blocks are indented).

How DuckDB thinks about spatial data

It helps to treat DuckDB like this:

A columnar analytics engine that excels at “scan → filter → aggregate”.
Spatial functions are just functions applied to a GEOMETRY column.
There’s no persistent spatial index like PostGIS by default; performance comes from:
- reducing what you scan (partitioning, column pruning),
- prefiltering candidates (bbox filters),
- precomputing reusable relationships (lookup tables).

This mindset changes how you design spatial workflows:

Instead of “store everything in a DB and query live”, you often:
- publish Parquet/GeoParquet as a data product,
- run repeatable SQL scripts to build derived datasets,
- cache/partition for predictable performance.

What the spatial extension adds

When you enable DuckDB’s spatial extension, you effectively unlock three capabilities:

1) A GEOMETRY type + spatial functions

You get a GEOMETRY type and a large set of ST_ functions, including:

validation: ST_IsValid, ST_MakeValid, ST_IsEmpty
measurements: ST_Area, ST_Length
transforms: ST_Transform, ST_SetSRID, ST_SRID
relationships: ST_Intersects, ST_Within, ST_Contains
utility: ST_GeometryType, ST_Envelope, ST_Centroid

2) GDAL IO (read common GIS formats)

Most modern workflows need to read files like:

GeoPackage
Shapefile
GeoJSON
many other OGR-supported formats

The spatial extension provides GDAL-backed IO so you can ingest data without a separate ETL tool.

3) A clean bridge to Parquet / GeoParquet

DuckDB is excellent at producing analytics outputs:

Parquet for fast scan-based queries
GeoParquet for interoperable geometry + metadata (when supported by your tooling)

Even when GeoParquet metadata writing differs by version/tool, DuckDB is still a great “build step” to produce stable, well-typed Parquet outputs.

Setup: install and load spatial

In DuckDB, extensions are explicit. In an ETL or analysis script, start with:

INSTALL spatial;
LOAD spatial;

If your environment blocks INSTALL, ensure the extension is already available and only run:

LOAD spatial;

Minimal dataset mental model: rows, columns, geometry

In DuckDB, a spatial dataset is just a table with:

attributes: id, name, date, value, …
a geometry column: commonly named geom or geometry

A typical shape for analytics:

one feature per row
geometry is WKB/WKT-backed internally (you treat it as GEOMETRY)
CRS meaning comes from SRID and metadata (and, critically, your pipeline discipline)

The “don’t get fooled” checks (do these every time)

Most geospatial bugs are not “errors”. They’re silent mismatches that produce plausible outputs.

This section is a checklist you can run immediately after reading any dataset.

Check 1: geometry type distribution

You want to know what you’re working with:

SELECT
  ST_GeometryType(geom) AS geom_type,
  COUNT(*) AS n
FROM src
GROUP BY 1
ORDER BY n DESC;

Why it matters:

many operations behave differently for POINT vs POLYGON
ST_MakeValid can change POLYGON → MULTIPOLYGON
mixed types often indicate a messy source (or a unioned dataset)

Check 2: geometry validity (and empties)

Invalid geometries can break intersections, areas, or even map rendering.

Count invalid:

SELECT COUNT(*) AS invalid_geoms
FROM src
WHERE geom IS NOT NULL AND NOT ST_IsValid(geom);

Count empties:

SELECT COUNT(*) AS empty_geoms
FROM src
WHERE geom IS NULL OR ST_IsEmpty(geom);

If invalid is non-trivial, plan a repair step:

CREATE OR REPLACE TABLE cleaned AS
SELECT
  * EXCLUDE (geom),
  ST_MakeValid(geom) AS geom
FROM src
WHERE geom IS NOT NULL;

Then remove empties:

CREATE OR REPLACE TABLE cleaned AS
SELECT *
FROM cleaned
WHERE geom IS NOT NULL AND NOT ST_IsEmpty(geom);

Re-check type distribution after repair (it often changes):

SELECT
  ST_GeometryType(geom) AS geom_type,
  COUNT(*) AS n
FROM cleaned
GROUP BY 1
ORDER BY n DESC;

Check 3: SRID and CRS sanity

SRID is the number that claims which CRS the coordinates are in. It’s not magic—sources can be wrong or missing.

Inspect SRID range:

SELECT
  MIN(ST_SRID(geom)) AS min_srid,
  MAX(ST_SRID(geom)) AS max_srid
FROM src;

Common realities:

SRID might be 0 / NULL (unknown)
SRID might vary across rows (bad sign)
SRID might claim EPSG:4326 but coordinates look like meters (also bad)

If SRID is missing but you know it from the source, set it explicitly:

CREATE OR REPLACE TABLE with_srid AS
SELECT
  * EXCLUDE (geom),
  ST_SetSRID(geom, 27700) AS geom
FROM src;

Only then transform:

CREATE OR REPLACE TABLE projected AS
SELECT
  * EXCLUDE (geom),
  ST_Transform(geom, 4326) AS geom
FROM with_srid;

Check 4: bbox sanity (fast CRS smell test)

Bounding boxes are the fastest way to detect CRS mistakes.

Compute extents:

SELECT
  MIN(ST_XMin(geom)) AS xmin,
  MIN(ST_YMin(geom)) AS ymin,
  MAX(ST_XMax(geom)) AS xmax,
  MAX(ST_YMax(geom)) AS ymax
FROM src
WHERE geom IS NOT NULL;

Interpretation heuristics:

If you believe the data is EPSG:4326:
- longitude should roughly be within [-180, 180]
- latitude should roughly be within [-90, 90]
If values are in the hundreds of thousands or millions, you’re likely in a projected CRS (meters), not degrees.

This single query catches a huge fraction of “my map is blank” issues.

Check 5: unit awareness (degrees vs meters)

This one causes subtle errors in analytics:

ST_Area and ST_Length are unit-dependent
in EPSG:4326, raw area/length is not meaningful in meters
for area/length analytics, you generally want a projected CRS suitable for your region

A practical rule:

keep a “portable” dataset in EPSG:4326 for interchange
do measurement-heavy computations in a projected CRS, then store results as numeric columns

Example pattern:

-- Project to a meter-based CRS for measurements (example CRS only)
CREATE OR REPLACE TABLE measured AS
SELECT
  *,
  ST_Area(ST_Transform(geom, 3857)) AS area_m2_approx
FROM src;

(For accurate area measurements, prefer an equal-area CRS appropriate for your geography.)

A repeatable workflow: read → validate → transform → publish

This is the mental model that scales.

Read

Use GDAL-backed reading for GIS formats, or read from Parquet.

CREATE OR REPLACE TABLE src AS
SELECT *
FROM ST_Read('data/input/admin.gpkg', layer='adm1');

Or:

CREATE OR REPLACE TABLE src AS
SELECT *
FROM read_parquet('data/input/admin_adm1.parquet');

Validate

Fix invalid and remove empties:

CREATE OR REPLACE TABLE cleaned AS
SELECT
  * EXCLUDE (geom),
  ST_MakeValid(geom) AS geom
FROM src
WHERE geom IS NOT NULL;

CREATE OR REPLACE TABLE cleaned AS
SELECT *
FROM cleaned
WHERE geom IS NOT NULL AND NOT ST_IsEmpty(geom);

Transform

Ensure SRID and reproject to your publishing CRS:

CREATE OR REPLACE TABLE projected AS
SELECT
  * EXCLUDE (geom),
  ST_Transform(geom, 4326) AS geom
FROM cleaned;

Publish

Export as Parquet (and optionally GeoParquet depending on your toolchain):

COPY projected
TO 'data/output/admin_adm1.parquet'
(FORMAT PARQUET);

Optional: add bbox columns before publishing (useful for prefilters):

CREATE OR REPLACE TABLE published AS
SELECT
  *,
  ST_XMin(geom) AS xmin,
  ST_YMin(geom) AS ymin,
  ST_XMax(geom) AS xmax,
  ST_YMax(geom) AS ymax
FROM projected;

COPY published
TO 'data/output/admin_adm1.parquet'
(FORMAT PARQUET);

Minimal reproducible project layout

A simple structure that scales from laptop experiments to CI builds:

geodata/
  input/
    admin.gpkg
    roads.shp
  sql/
    01_read.sql
    02_clean.sql
    03_project.sql
    04_publish.sql
  output/
    admin_adm1.parquet
  duckdb/
    pipeline.duckdb

Recommended practice:

keep SQL scripts small and single-purpose
publish versioned outputs (e.g., folder output/v2026_01_27/)
always validate outputs by re-reading them (trust but verify)

A “first 10 minutes” checklist for any dataset

When you open a new dataset, run these in order:

Load spatial extension
Read into src
Row count and sample rows
Geometry type distribution
Invalid and empty geometry counts
SRID min/max
Bbox extents and sanity check
Decide publishing CRS (portable) and measurement CRS (analytics)
Clean + project
Export Parquet and re-read to confirm schema

A compact set of queries you can copy-paste:

-- types
SELECT ST_GeometryType(geom), COUNT(*) FROM src GROUP BY 1;

-- validity
SELECT COUNT(*) FROM src WHERE geom IS NOT NULL AND NOT ST_IsValid(geom);

-- srid
SELECT MIN(ST_SRID(geom)), MAX(ST_SRID(geom)) FROM src;

-- extent
SELECT
  MIN(ST_XMin(geom)), MIN(ST_YMin(geom)),
  MAX(ST_XMax(geom)), MAX(ST_YMax(geom))
FROM src
WHERE geom IS NOT NULL;