DuckDB Spatial as an ETL Engine: GeoPackage/Shapefile → Parquet/GeoParquet


DuckDB Spatial as an ETL Engine: GeoPackage/Shapefile → Parquet/GeoParquet

Geospatial ETL often starts simple (download a GeoPackage/Shapefile) and ends up heavy (load into PostGIS/QGIS, fix geometries, reproject, export, repeat). If your end goal is an analytics-ready format—especially Parquet or GeoParquet—you can do a huge amount of work with DuckDB + duckdb-spatial directly, using SQL.

This post focuses on a practical “data build” approach:

  • Read GeoPackage and Shapefile via DuckDB Spatial (GDAL-backed)
  • Validate/fix geometries
  • Reproject
  • Normalize attributes and types
  • Export to Parquet (and optionally GeoParquet)

Why DuckDB Spatial works well for ETL

DuckDB runs embedded (CLI/Python/Node/WASM) and the spatial extension adds:

  • ST_ functions (validation, transforms, bbox, etc.)
  • GDAL import/export for many formats (GeoPackage, Shapefile, GeoJSON, …)
  • Columnar outputs (Parquet) for analytics and cloud-native delivery
  • Repeatability: one SQL script = one deterministic build

Install and load the Spatial extension

In DuckDB, extensions are explicit. You typically do this once at the top of your ETL script:

INSTALL spatial;
LOAD spatial;

In locked-down environments you might not be allowed to INSTALL. In that case, ensure the extension is available and just LOAD spatial;.


Read a GeoPackage (GPKG)

Read a specific layer into a table:

CREATE OR REPLACE TABLE src AS
SELECT *
FROM ST_Read('data/input/admin.gpkg', layer='adm1');

Quick sanity checks:

DESCRIBE src;
SELECT COUNT(*) FROM src;
SELECT * FROM src LIMIT 5;

Read a Shapefile

Shapefiles are multi-file datasets (.shp, .dbf, .prj, …). DuckDB reads the .shp path:

CREATE OR REPLACE TABLE src AS
SELECT *
FROM ST_Read('data/input/roads.shp');

Geometry checks: type, SRID, validity

Geometry types distribution
SELECT
  ST_GeometryType(geom) AS geom_type,
  COUNT(*) AS n
FROM src
GROUP BY 1
ORDER BY n DESC;
SRID range (if present)
SELECT
  MIN(ST_SRID(geom)) AS min_srid,
  MAX(ST_SRID(geom)) AS max_srid
FROM src;

SRID can be missing or inconsistent in real-world data. Treat CRS assignment as a “high stakes” ETL step.

Count invalid geometries
SELECT COUNT(*) AS invalid_geoms
FROM src
WHERE NOT ST_IsValid(geom);

Fix invalid geometries (and drop empties)

A common pipeline step is to ST_MakeValid and remove empty/null geometries:

CREATE OR REPLACE TABLE cleaned AS
SELECT
  * EXCLUDE (geom),
  ST_MakeValid(geom) AS geom
FROM src
WHERE geom IS NOT NULL;

CREATE OR REPLACE TABLE cleaned AS
SELECT *
FROM cleaned
WHERE geom IS NOT NULL AND NOT ST_IsEmpty(geom);

Re-check types after fixing (validity fixes can produce MULTI* or collections):

SELECT
  ST_GeometryType(geom) AS geom_type,
  COUNT(*) AS n
FROM cleaned
GROUP BY 1
ORDER BY n DESC;

Reproject to a standard CRS (WGS84)

Publishing in EPSG:4326 is usually the most portable choice for data products.

If SRID is correct already:

CREATE OR REPLACE TABLE projected AS
SELECT
  * EXCLUDE (geom),
  ST_Transform(geom, 4326) AS geom
FROM cleaned;

If SRID is missing but you know the source CRS, set it first (example uses EPSG:27700):

CREATE OR REPLACE TABLE projected AS
SELECT
  * EXCLUDE (geom),
  ST_Transform(ST_SetSRID(geom, 27700), 4326) AS geom
FROM cleaned;

Normalize attributes: rename, trim, cast types

Parquet benefits from stable, explicit types.

Example “final schema” step:

CREATE OR REPLACE TABLE final AS
SELECT
  CAST(id AS BIGINT)                 AS id,
  NULLIF(TRIM(name), '')             AS name,
  CAST(population AS BIGINT)         AS population,
  CAST(area_km2 AS DOUBLE)           AS area_km2,
  geom
FROM projected;

Add helpful bbox columns (optional but useful)

Bounding boxes help downstream filtering and can speed up interactive workflows:

CREATE OR REPLACE TABLE final_indexed AS
SELECT
  *,
  ST_XMin(geom) AS xmin,
  ST_YMin(geom) AS ymin,
  ST_XMax(geom) AS xmax,
  ST_YMax(geom) AS ymax
FROM final;

Export to Parquet

Single-file export:

COPY final_indexed
TO 'data/output/admin_adm1.parquet'
(FORMAT PARQUET);

Partitioned export (recommended for large datasets):

COPY final_indexed
TO 'data/output/admin_adm1_by_country/'
(FORMAT PARQUET, PARTITION_BY (country_code));

GeoParquet notes (practical)

“GeoParquet” means Parquet + standardized metadata describing the geometry column and CRS. Tooling support depends on versions.

A robust workflow is:

  1. Ensure geom is the geometry column name and CRS is correct (e.g., 4326).
  2. Export Parquet as above.
  3. If your environment supports writing GeoParquet metadata directly, enable that for export.

If you tell me your DuckDB version (and whether you run CLI, Python, or Node), I can give the exact GeoParquet export statement that works for your setup.


End-to-end ETL: one reproducible SQL script

Save as etl_admin.sql:

INSTALL spatial;
LOAD spatial;

CREATE OR REPLACE TABLE src AS
SELECT *
FROM ST_Read('data/input/admin.gpkg', layer='adm1');

CREATE OR REPLACE TABLE cleaned AS
SELECT
  * EXCLUDE (geom),
  ST_MakeValid(geom) AS geom
FROM src
WHERE geom IS NOT NULL;

CREATE OR REPLACE TABLE cleaned AS
SELECT *
FROM cleaned
WHERE geom IS NOT NULL AND NOT ST_IsEmpty(geom);

CREATE OR REPLACE TABLE projected AS
SELECT
  * EXCLUDE (geom),
  ST_Transform(geom, 4326) AS geom
FROM cleaned;

CREATE OR REPLACE TABLE final AS
SELECT
  CAST(id AS BIGINT)     AS id,
  NULLIF(TRIM(name), '') AS name,
  geom
FROM projected;

COPY final TO 'data/output/admin_adm1.parquet' (FORMAT PARQUET);

Run with DuckDB CLI:

duckdb etl.duckdb -c ".read etl_admin.sql"