Storing H3 Efficiently: BIGINT vs String, Dictionary Encoding, and Parquet Compression Tips


Storing H3 Efficiently: BIGINT vs String, Dictionary Encoding, and Parquet Compression Tips

If you store global time-series data on H3 (climate indices, mobility, land cover, risk scores), you quickly discover that how you store the H3 index can make or break your system.

H3 IDs show up in almost every query:

  • joins (cell → metrics, cell → admin)
  • grouping (rollups to parents)
  • filtering (viewport/time partitions)
  • exporting (Parquet/GeoParquet)

This post is a practical deep dive into:

  • H3 index as string vs BIGINT
  • why dictionary encoding matters
  • Parquet compression tips that actually help
  • schema patterns that are friendly for DuckDB, Spark/Arrow/Polars, and browser use

All code blocks are indented to avoid Markdown-breaking nested fences.


What exactly is an “H3 index” in storage terms?

An H3 cell identifier is typically represented in two common ways:

  • string form (hex string, e.g. 8928308280fffff)
  • integer form (often called “H3 bigint” or “H3 int64” representation)

Both represent the same cell, but storage and performance characteristics differ.

Your choice impacts:

  • file size and compression ratio
  • join and group-by performance
  • interoperability across languages/tools
  • how easily you can compute parent/children relationships

BIGINT vs String: the tradeoffs

String (hex) H3 index

Pros:

  • human-readable and easy to debug
  • very common in geospatial tooling and APIs
  • stable interchange format (copy/paste friendly)
  • avoids integer signedness edge cases across languages

Cons:

  • typically larger on disk than an integer
  • can be slower to join/group at very large scale (string comparisons)
  • more memory overhead in some runtimes
BIGINT (int64) H3 index

Pros:

  • compact fixed-width representation
  • fast joins and group-bys in many engines
  • often compresses extremely well in Parquet
  • good for high-volume time-series fact tables

Cons:

  • not human-readable (harder debugging)
  • signed/unsigned nuance across languages (some ecosystems treat it differently)
  • some external tools expect string form, so you may need conversion

Practical recommendation: store both (but in the right places)

A common, production-friendly pattern is:

  • store BIGINT in the fact tables (metrics, time series)
  • store string only where needed for interoperability or UI
  • maintain a dimension table mapping h3_bigint ↔ h3_string

This gives you the best of both worlds:

  • fast joins/groupbys for analytics
  • easy export/debug when needed

Schema patterns that scale

Pattern A: fact table (metrics) with BIGINT
h3_bigint   BIGINT   -- primary join key
date        DATE
spi_3       DOUBLE
spei_3      DOUBLE
...

This is your high-volume table. Make it as compact as possible.

Pattern B: dimension table with both representations
h3_bigint   BIGINT
h3_index    VARCHAR   -- the hex string
res         INTEGER   -- optional: store resolution
...

Optionally add geometry (but keep it separate if you can):

  • cells_geom table keyed by h3_bigint or h3_index
Pattern C: lookup tables for admin joins
h3_bigint   BIGINT
admin_id    INTEGER
weight      DOUBLE   -- optional area fraction

This allows fast region summaries without geometry at runtime.


Dictionary encoding: why repeated H3 values compress well

Parquet supports multiple encodings, and a very important one for categorical-like columns is dictionary encoding.

If a column repeats the same values many times, dictionary encoding stores:

  • a dictionary of unique values
  • integer codes referencing the dictionary for each row

Where H3 shows up:

  • time series tables repeat the same H3 IDs across many dates
  • lookup tables may repeat admin IDs across many H3 cells
  • “res” column repeats constantly

That means dictionary encoding can drastically reduce size and speed up scans.

Best case: time-series with repeated H3 cells

If you have:

  • 1 million H3 cells
  • 500 monthly timestamps

You have 500 million rows, but only 1 million unique H3 IDs. That’s perfect for dictionary encoding.

The net effect:

  • H3 column compresses like a categorical column
  • scanning becomes cheaper because fewer bytes need to be read

When strings can still compress surprisingly well

Even though strings are larger, Parquet can still compress them effectively when:

  • values share prefixes (they often do by resolution patterns)
  • dictionary encoding is applied
  • compression codec is strong

But in practice, BIGINT is still usually more compact and faster for heavy analytics.


Compression codecs: what matters in practice

Parquet compression codecs commonly used:

  • Snappy: fast, widely supported, good default
  • ZSTD: better compression ratio, still fast enough in many cases
  • GZIP: strong compression but slower; less common for analytics workflows

Practical guidance:

  • use Snappy when you prioritize speed and wide interoperability
  • use ZSTD when you want smaller files (especially for object storage + browser delivery)

If your workflow includes DuckDB WASM or CDN delivery, smaller files can be worth it.


Parquet layout tips for H3 datasets

Tip 1: partition by time (and maybe by resolution)

Time is the most common filter.

metrics/
  res=6/
    year=2025/
      month=07/
        part-0000.parquet

This allows partition pruning so you don’t scan unnecessary months.

Tip 2: keep geometry out of the big fact table

Geometry columns are heavy. Don’t put them in the same Parquet as time-series metrics unless you truly need them.

Use:

  • metrics.parquet (no geometry)
  • cells_geom.parquet (geometry once)
  • join when needed
Tip 3: keep columns narrow and typed

Avoid “stringly typed” schemas.

  • store date as DATE
  • store indicators as DOUBLE/FLOAT
  • store ids as BIGINT/INTEGER

Good typing improves compression and query performance.

Tip 4: consider sorting or clustering by H3

Many engines benefit when related rows are physically close.

If you write Parquet so that rows are grouped by:

  • h3_bigint, then date

…you can improve:

  • dictionary encoding efficiency
  • row group pruning
  • join locality in some workloads

Even if you don’t explicitly “sort”, writing data in a consistent order often helps.

Tip 5: compute and store resolution as a separate column

If you store multi-resolution datasets, it helps to carry res explicitly:

  • easier sanity checks
  • easier validation
  • easier debugging across partitions

Interoperability: where BIGINT can bite you

Not all ecosystems handle 64-bit integers equally.

Potential issues:

  • JavaScript Number cannot exactly represent all 64-bit integers
  • some JSON serializers lose precision if you export BIGINT directly
  • some tools treat int64 as signed while others use unsigned semantics

Practical workarounds:

  • in APIs: send H3 as string
  • in analytics: store as BIGINT
  • in the browser: keep h3_index (string) as the exchange format

That’s why the “store both” pattern is so useful.


Joining strategy: why BIGINT often wins

On very large datasets, joins on BIGINT typically outperform joins on VARCHAR.

A recommended approach:

  • convert incoming string IDs once
  • then join on BIGINT internally

If your ingestion starts with strings, build the dimension mapping and normalize early.


DuckDB-friendly workflow: normalize and publish

A typical build step:

  1. ingest raw data
  2. generate or attach h3_bigint
  3. write metrics Parquet (BIGINT key)
  4. write dimension Parquet (bigint ↔ string)
  5. write geometry Parquet (optional)

Example conceptual normalization step:

CREATE OR REPLACE TABLE metrics_norm AS
SELECT
  h3_bigint,
  date,
  spi_3,
  spei_3
FROM metrics_raw;

Then publish:

COPY metrics_norm
TO 'output/metrics/res=6/year=2025/month=07/'
(FORMAT PARQUET, PARTITION_BY (year, month));

(Your exact partition columns depend on how you model time.)


A practical “do this” checklist

If you’re building an H3-based analytics product:

  • Use BIGINT for internal joins and aggregations
  • Keep a mapping table to string IDs for exports/APIs/UI
  • Keep geometry separate from time-series metrics
  • Partition Parquet primarily by time (and by resolution if multi-res)
  • Prefer Snappy for speed, ZSTD for smaller files (especially for object storage/browser)
  • Validate that your chosen representation works end-to-end across your toolchain (Python, JS, DuckDB WASM)