DuckDB Spatial as an ETL Engine: GeoPackage/Shapefile → Parquet/GeoParquet
DuckDB Spatial as an ETL Engine: GeoPackage/Shapefile → Parquet/GeoParquet
Geospatial ETL often starts simple (download a GeoPackage/Shapefile) and ends up heavy (load into PostGIS/QGIS, fix geometries, reproject, export, repeat). If your end goal is an analytics-ready format—especially Parquet or GeoParquet—you can do a huge amount of work with DuckDB + duckdb-spatial directly, using SQL.
This post focuses on a practical “data build” approach:
- Read GeoPackage and Shapefile via DuckDB Spatial (GDAL-backed)
- Validate/fix geometries
- Reproject
- Normalize attributes and types
- Export to Parquet (and optionally GeoParquet)
Why DuckDB Spatial works well for ETL
DuckDB runs embedded (CLI/Python/Node/WASM) and the spatial extension adds:
ST_functions (validation, transforms, bbox, etc.)- GDAL import/export for many formats (GeoPackage, Shapefile, GeoJSON, …)
- Columnar outputs (Parquet) for analytics and cloud-native delivery
- Repeatability: one SQL script = one deterministic build
Install and load the Spatial extension
In DuckDB, extensions are explicit. You typically do this once at the top of your ETL script:
INSTALL spatial;
LOAD spatial;
In locked-down environments you might not be allowed to
INSTALL. In that case, ensure the extension is available and justLOAD spatial;.
Read a GeoPackage (GPKG)
Read a specific layer into a table:
CREATE OR REPLACE TABLE src AS
SELECT *
FROM ST_Read('data/input/admin.gpkg', layer='adm1');
Quick sanity checks:
DESCRIBE src;
SELECT COUNT(*) FROM src;
SELECT * FROM src LIMIT 5;
Read a Shapefile
Shapefiles are multi-file datasets (.shp, .dbf, .prj, …). DuckDB reads the .shp path:
CREATE OR REPLACE TABLE src AS
SELECT *
FROM ST_Read('data/input/roads.shp');
Geometry checks: type, SRID, validity
Geometry types distribution
SELECT
ST_GeometryType(geom) AS geom_type,
COUNT(*) AS n
FROM src
GROUP BY 1
ORDER BY n DESC;
SRID range (if present)
SELECT
MIN(ST_SRID(geom)) AS min_srid,
MAX(ST_SRID(geom)) AS max_srid
FROM src;
SRID can be missing or inconsistent in real-world data. Treat CRS assignment as a “high stakes” ETL step.
Count invalid geometries
SELECT COUNT(*) AS invalid_geoms
FROM src
WHERE NOT ST_IsValid(geom);
Fix invalid geometries (and drop empties)
A common pipeline step is to ST_MakeValid and remove empty/null geometries:
CREATE OR REPLACE TABLE cleaned AS
SELECT
* EXCLUDE (geom),
ST_MakeValid(geom) AS geom
FROM src
WHERE geom IS NOT NULL;
CREATE OR REPLACE TABLE cleaned AS
SELECT *
FROM cleaned
WHERE geom IS NOT NULL AND NOT ST_IsEmpty(geom);
Re-check types after fixing (validity fixes can produce MULTI* or collections):
SELECT
ST_GeometryType(geom) AS geom_type,
COUNT(*) AS n
FROM cleaned
GROUP BY 1
ORDER BY n DESC;
Reproject to a standard CRS (WGS84)
Publishing in EPSG:4326 is usually the most portable choice for data products.
If SRID is correct already:
CREATE OR REPLACE TABLE projected AS
SELECT
* EXCLUDE (geom),
ST_Transform(geom, 4326) AS geom
FROM cleaned;
If SRID is missing but you know the source CRS, set it first (example uses EPSG:27700):
CREATE OR REPLACE TABLE projected AS
SELECT
* EXCLUDE (geom),
ST_Transform(ST_SetSRID(geom, 27700), 4326) AS geom
FROM cleaned;
Normalize attributes: rename, trim, cast types
Parquet benefits from stable, explicit types.
Example “final schema” step:
CREATE OR REPLACE TABLE final AS
SELECT
CAST(id AS BIGINT) AS id,
NULLIF(TRIM(name), '') AS name,
CAST(population AS BIGINT) AS population,
CAST(area_km2 AS DOUBLE) AS area_km2,
geom
FROM projected;
Add helpful bbox columns (optional but useful)
Bounding boxes help downstream filtering and can speed up interactive workflows:
CREATE OR REPLACE TABLE final_indexed AS
SELECT
*,
ST_XMin(geom) AS xmin,
ST_YMin(geom) AS ymin,
ST_XMax(geom) AS xmax,
ST_YMax(geom) AS ymax
FROM final;
Export to Parquet
Single-file export:
COPY final_indexed
TO 'data/output/admin_adm1.parquet'
(FORMAT PARQUET);
Partitioned export (recommended for large datasets):
COPY final_indexed
TO 'data/output/admin_adm1_by_country/'
(FORMAT PARQUET, PARTITION_BY (country_code));
GeoParquet notes (practical)
“GeoParquet” means Parquet + standardized metadata describing the geometry column and CRS. Tooling support depends on versions.
A robust workflow is:
- Ensure
geomis the geometry column name and CRS is correct (e.g., 4326). - Export Parquet as above.
- If your environment supports writing GeoParquet metadata directly, enable that for export.
If you tell me your DuckDB version (and whether you run CLI, Python, or Node), I can give the exact GeoParquet export statement that works for your setup.
End-to-end ETL: one reproducible SQL script
Save as etl_admin.sql:
INSTALL spatial;
LOAD spatial;
CREATE OR REPLACE TABLE src AS
SELECT *
FROM ST_Read('data/input/admin.gpkg', layer='adm1');
CREATE OR REPLACE TABLE cleaned AS
SELECT
* EXCLUDE (geom),
ST_MakeValid(geom) AS geom
FROM src
WHERE geom IS NOT NULL;
CREATE OR REPLACE TABLE cleaned AS
SELECT *
FROM cleaned
WHERE geom IS NOT NULL AND NOT ST_IsEmpty(geom);
CREATE OR REPLACE TABLE projected AS
SELECT
* EXCLUDE (geom),
ST_Transform(geom, 4326) AS geom
FROM cleaned;
CREATE OR REPLACE TABLE final AS
SELECT
CAST(id AS BIGINT) AS id,
NULLIF(TRIM(name), '') AS name,
geom
FROM projected;
COPY final TO 'data/output/admin_adm1.parquet' (FORMAT PARQUET);
Run with DuckDB CLI:
duckdb etl.duckdb -c ".read etl_admin.sql"