GeoParquet 101: Geometry Storage, Bounding Boxes, Metadata, and Interoperability


GeoParquet 101: geometry is just bytes—until metadata makes it spatial

Parquet is a columnar file format built for analytics. It’s fast, compresses well, and is widely supported by modern data tooling.

But vanilla Parquet has no built-in concept of “geometry”.

GeoParquet bridges that gap by standardising:

  • how a geometry column is encoded (typically WKB),
  • what coordinate reference system (CRS) the geometry uses,
  • how to describe the geometry column and its properties in file metadata,
  • optional extra metadata like bounding boxes to speed up common spatial workflows.

If you’ve ever exported geometries to “Parquet” and found that one tool reads it as spatial while another sees “just binary”, GeoParquet is the fix.

This post explains the practical details you need to publish GeoParquet datasets that behave well across engines like DuckDB, GDAL/OGR, GeoPandas, Arrow, and cloud-native stacks.


What is GeoParquet?

GeoParquet is a specification for storing geospatial vector data in Parquet in a tool-agnostic way.

A GeoParquet file is basically:

  • a normal Parquet file with rows and columns, plus
  • a designated geometry column (or columns), plus
  • required metadata (in Parquet key-value metadata) describing the geometry encoding and CRS.

The crucial idea: Parquet stores the data, metadata tells tools how to interpret it.


How geometry is stored in GeoParquet

The “default” encoding: WKB

In most GeoParquet datasets, geometry is stored as WKB (Well-Known Binary):

  • The geometry column is a binary-like type (often Arrow binary).
  • Each row stores one geometry object encoded in WKB.
  • WKB is compact, portable, and widely supported.

So your schema often looks like:

  • id: int64
  • name: string
  • geom: binary ← but “binary” alone doesn’t tell anyone it’s geometry

GeoParquet metadata is what marks geom as a geometry column and specifies encoding, CRS, etc.

Why WKB?

Because it’s:

  • supported across GIS tools and libraries,
  • easy to serialize/deserialize,
  • stable for interchange,
  • efficient with Parquet’s compression and encoding.
Alternative encodings

GeoParquet also supports other encodings (tooling-dependent), but WKB is the most interoperable starting point.

If you want maximum compatibility across engines, publish WKB.


Bounding boxes: two different concepts

“Bounding box” can mean two different things in this ecosystem:

1) Dataset-level bounding box (metadata)

A single bbox describing the extent of the entire dataset:

  • xmin, ymin, xmax, ymax

This is useful for:

  • quick map view framing,
  • cataloging,
  • sanity checks (wrong CRS often shows up immediately).
2) Feature-level bounding boxes (columns or computed)

A bbox per geometry row:

  • xmin, ymin, xmax, ymax columns

This is useful for:

  • fast pre-filtering before doing expensive geometry ops,
  • viewport queries,
  • spatial indexing-like workflows in analytical engines.

In practice, many “analytics-first” pipelines add bbox columns explicitly because they are extremely useful even when the engine doesn’t build spatial indexes.

Example derived columns:

SELECT
  *,
  ST_XMin(geom) AS xmin,
  ST_YMin(geom) AS ymin,
  ST_XMax(geom) AS xmax,
  ST_YMax(geom) AS ymax
FROM features;

A common pattern: use bbox filters first, then precise geometry predicates only on the candidates.


GeoParquet metadata: the heart of interoperability

Interoperability comes from metadata.

GeoParquet uses Parquet key-value metadata to define (at least):

  • which column(s) are geometry
  • how they’re encoded (WKB, etc.)
  • CRS definition
  • geometry types (optional but helpful)
  • coordinate axis order and CRS semantics
What happens if metadata is missing?

Then tools see:

  • geom is binary
  • there is no known CRS
  • the file is “just Parquet”

Some tools will still work if you manually tell them:

  • “this binary column is WKB”
  • “assume EPSG:4326”

…but that’s exactly the point: without standard metadata, every consumer has to guess.

What happens if metadata is wrong?

Worse than missing metadata.

Common failure modes:

  • CRS declared as EPSG:4326 but the coordinates are actually meters (e.g., EPSG:3857 or a national grid)
  • geometry encoding declared incorrectly (rare, but possible)
  • axis order confusion in some older ecosystems

If you publish GeoParquet, treat metadata correctness as a “contract”.


Practical interoperability checklist

If you want your GeoParquet to behave predictably across tools:

1) Use WKB geometry encoding

It’s the most widely supported.

2) Use a single geometry column named consistently

geom or geometry are common conventions. Consistency reduces friction.

3) Ensure CRS is correct and explicit

If your analytics use lat/lon and web maps, EPSG:4326 is often the safest default for interchange.

4) Avoid exotic geometry types unless needed

Stick to POINT, LINESTRING, POLYGON, and their MULTI variants.

If your pipeline can emit GEOMETRYCOLLECTION, some consumers may struggle.

5) Make validity a pipeline step

Invalid geometries can break spatial predicates in downstream tools.


Reading GeoParquet in DuckDB (common workflow)

In DuckDB, you typically:

  • read the parquet
  • interpret geometry via spatial functions
  • run spatial SQL

A pattern you’ll see often:

INSTALL spatial;
LOAD spatial;

-- Read parquet (GeoParquet metadata helps tools understand geometry semantics)
CREATE OR REPLACE TABLE g AS
SELECT * FROM read_parquet('data/output/admin_adm1.parquet');

Then validate:

SELECT
  ST_GeometryType(geom) AS geom_type,
  COUNT(*) AS n
FROM g
GROUP BY 1
ORDER BY n DESC;

And compute extents:

SELECT
  MIN(ST_XMin(geom)) AS xmin,
  MIN(ST_YMin(geom)) AS ymin,
  MAX(ST_XMax(geom)) AS xmax,
  MAX(ST_YMax(geom)) AS ymax
FROM g;

In many workflows, you store those numbers as dataset metadata for catalogs or quick map framing.


Writing GeoParquet: why it’s sometimes confusing

The ecosystem is still evolving.

Many tools can write “Parquet with geometry as WKB”, but not all of them write GeoParquet metadata consistently.

So you’ll see these scenarios:

  • Tool A writes perfect GeoParquet metadata
  • Tool B writes WKB but no metadata
  • Tool C writes metadata but uses a CRS representation that other tools misread

That’s why a workflow that includes:

  • validation step (read your own output),
  • metadata inspection,
  • and interoperability testing with at least two different readers

is essential if you publish GeoParquet as a data product.


When you publish GeoParquet, test at least:

Test 1: Schema + geometry readability
  • geometry column is present
  • geometry can be parsed
  • geometry type distribution is sensible
Test 2: CRS sanity check
  • bbox values look realistic for the intended CRS

Examples:

  • EPSG:4326 should have longitude within [-180, 180] and latitude within [-90, 90]
  • EPSG:3857 should have coordinates in meters with magnitude up to ~20 million
Test 3: One spatial operation
  • ST_Intersects, ST_Within, or ST_Area runs successfully
Test 4: Cross-tool read

Read the same file in a second tool (e.g., GDAL/OGR or GeoPandas) to confirm CRS and geometry interpretation match.


Cloud-native GeoParquet: why this format is a big deal

GeoParquet unlocks a data distribution model that looks like this:

  • publish files to object storage (S3/R2/Azure Blob)
  • query in DuckDB (local), Spark, Polars, Arrow engines
  • query in the browser (DuckDB WASM)
  • derive tiles or summaries as build artifacts

Instead of treating a GIS database as “the product”, you treat a GeoParquet dataset as the product.

This matters because:

  • it’s versionable (immutable file releases)
  • it’s cacheable (CDN friendly)
  • it’s portable (works across stacks)
  • it encourages reproducible pipelines

Common mistakes (and how to avoid them)

Mistake 1: “It’s Parquet, so it’s GeoParquet”

Not necessarily. GeoParquet requires correct metadata.

Mistake 2: Wrong CRS declared

Always do a bbox sanity check after exporting.

Mistake 3: Forgetting geometry validity

Run an ST_IsValid check and repair pipeline where needed.

Mistake 4: One giant file

For cloud-native querying, partition by something meaningful (date, country, admin level) to avoid scanning too much.


A practical publishing pattern

If you’re building analytics-first datasets:

  1. Produce canonical GeoParquet at EPSG:4326 (WKB geometry)
  2. Add bbox columns (xmin,ymin,xmax,ymax) for fast filtering
  3. Partition outputs by a high-selectivity key (e.g., country_code, year, month)
  4. Publish to object storage
  5. Validate output by reading it in at least two tools