Storing H3 Efficiently: BIGINT vs String, Dictionary Encoding, and Parquet Compression Tips

Dec 18, 2025

Storing H3 Efficiently: BIGINT vs String, Dictionary Encoding, and Parquet Compression Tips

If you store global time-series data on H3 (climate indices, mobility, land cover, risk scores), you quickly discover that how you store the H3 index can make or break your system.

H3 IDs show up in almost every query:

joins (cell → metrics, cell → admin)
grouping (rollups to parents)
filtering (viewport/time partitions)
exporting (Parquet/GeoParquet)

This post is a practical deep dive into:

H3 index as string vs BIGINT
why dictionary encoding matters
Parquet compression tips that actually help
schema patterns that are friendly for DuckDB, Spark/Arrow/Polars, and browser use

All code blocks are indented to avoid Markdown-breaking nested fences.

What exactly is an “H3 index” in storage terms?

An H3 cell identifier is typically represented in two common ways:

string form (hex string, e.g. 8928308280fffff)
integer form (often called “H3 bigint” or “H3 int64” representation)

Both represent the same cell, but storage and performance characteristics differ.

Your choice impacts:

file size and compression ratio
join and group-by performance
interoperability across languages/tools
how easily you can compute parent/children relationships

BIGINT vs String: the tradeoffs

String (hex) H3 index

Pros:

human-readable and easy to debug
very common in geospatial tooling and APIs
stable interchange format (copy/paste friendly)
avoids integer signedness edge cases across languages

Cons:

typically larger on disk than an integer
can be slower to join/group at very large scale (string comparisons)
more memory overhead in some runtimes

BIGINT (int64) H3 index

Pros:

compact fixed-width representation
fast joins and group-bys in many engines
often compresses extremely well in Parquet
good for high-volume time-series fact tables

Cons:

not human-readable (harder debugging)
signed/unsigned nuance across languages (some ecosystems treat it differently)
some external tools expect string form, so you may need conversion

Practical recommendation: store both (but in the right places)

A common, production-friendly pattern is:

store BIGINT in the fact tables (metrics, time series)
store string only where needed for interoperability or UI
maintain a dimension table mapping h3_bigint ↔ h3_string

This gives you the best of both worlds:

fast joins/groupbys for analytics
easy export/debug when needed

Schema patterns that scale

Pattern A: fact table (metrics) with BIGINT

h3_bigint   BIGINT   -- primary join key
date        DATE
spi_3       DOUBLE
spei_3      DOUBLE
...

This is your high-volume table. Make it as compact as possible.

Pattern B: dimension table with both representations

h3_bigint   BIGINT
h3_index    VARCHAR   -- the hex string
res         INTEGER   -- optional: store resolution
...

Optionally add geometry (but keep it separate if you can):

cells_geom table keyed by h3_bigint or h3_index

Pattern C: lookup tables for admin joins

h3_bigint   BIGINT
admin_id    INTEGER
weight      DOUBLE   -- optional area fraction

This allows fast region summaries without geometry at runtime.

Dictionary encoding: why repeated H3 values compress well

Parquet supports multiple encodings, and a very important one for categorical-like columns is dictionary encoding.

If a column repeats the same values many times, dictionary encoding stores:

a dictionary of unique values
integer codes referencing the dictionary for each row

Where H3 shows up:

time series tables repeat the same H3 IDs across many dates
lookup tables may repeat admin IDs across many H3 cells
“res” column repeats constantly

That means dictionary encoding can drastically reduce size and speed up scans.

Best case: time-series with repeated H3 cells

If you have:

1 million H3 cells
500 monthly timestamps

You have 500 million rows, but only 1 million unique H3 IDs. That’s perfect for dictionary encoding.

The net effect:

H3 column compresses like a categorical column
scanning becomes cheaper because fewer bytes need to be read

When strings can still compress surprisingly well

Even though strings are larger, Parquet can still compress them effectively when:

values share prefixes (they often do by resolution patterns)
dictionary encoding is applied
compression codec is strong

But in practice, BIGINT is still usually more compact and faster for heavy analytics.

Compression codecs: what matters in practice

Parquet compression codecs commonly used:

Snappy: fast, widely supported, good default
ZSTD: better compression ratio, still fast enough in many cases
GZIP: strong compression but slower; less common for analytics workflows

Practical guidance:

use Snappy when you prioritize speed and wide interoperability
use ZSTD when you want smaller files (especially for object storage + browser delivery)

If your workflow includes DuckDB WASM or CDN delivery, smaller files can be worth it.

Parquet layout tips for H3 datasets

Tip 1: partition by time (and maybe by resolution)

Time is the most common filter.

metrics/
  res=6/
    year=2025/
      month=07/
        part-0000.parquet

This allows partition pruning so you don’t scan unnecessary months.

Tip 2: keep geometry out of the big fact table

Geometry columns are heavy. Don’t put them in the same Parquet as time-series metrics unless you truly need them.

Use:

metrics.parquet (no geometry)
cells_geom.parquet (geometry once)
join when needed

Tip 3: keep columns narrow and typed

Avoid “stringly typed” schemas.

store date as DATE
store indicators as DOUBLE/FLOAT
store ids as BIGINT/INTEGER

Good typing improves compression and query performance.

Tip 4: consider sorting or clustering by H3

Many engines benefit when related rows are physically close.

If you write Parquet so that rows are grouped by:

h3_bigint, then date

…you can improve:

dictionary encoding efficiency
row group pruning
join locality in some workloads

Even if you don’t explicitly “sort”, writing data in a consistent order often helps.

Tip 5: compute and store resolution as a separate column

If you store multi-resolution datasets, it helps to carry res explicitly:

easier sanity checks
easier validation
easier debugging across partitions

Interoperability: where BIGINT can bite you

Not all ecosystems handle 64-bit integers equally.

Potential issues:

JavaScript Number cannot exactly represent all 64-bit integers
some JSON serializers lose precision if you export BIGINT directly
some tools treat int64 as signed while others use unsigned semantics

Practical workarounds:

in APIs: send H3 as string
in analytics: store as BIGINT
in the browser: keep h3_index (string) as the exchange format

That’s why the “store both” pattern is so useful.

Joining strategy: why BIGINT often wins

On very large datasets, joins on BIGINT typically outperform joins on VARCHAR.

A recommended approach:

convert incoming string IDs once
then join on BIGINT internally

If your ingestion starts with strings, build the dimension mapping and normalize early.

DuckDB-friendly workflow: normalize and publish

A typical build step:

ingest raw data
generate or attach h3_bigint
write metrics Parquet (BIGINT key)
write dimension Parquet (bigint ↔ string)
write geometry Parquet (optional)

Example conceptual normalization step:

CREATE OR REPLACE TABLE metrics_norm AS
SELECT
  h3_bigint,
  date,
  spi_3,
  spei_3
FROM metrics_raw;

Then publish:

COPY metrics_norm
TO 'output/metrics/res=6/year=2025/month=07/'
(FORMAT PARQUET, PARTITION_BY (year, month));

(Your exact partition columns depend on how you model time.)

A practical “do this” checklist

If you’re building an H3-based analytics product:

Use BIGINT for internal joins and aggregations
Keep a mapping table to string IDs for exports/APIs/UI
Keep geometry separate from time-series metrics
Partition Parquet primarily by time (and by resolution if multi-res)
Prefer Snappy for speed, ZSTD for smaller files (especially for object storage/browser)
Validate that your chosen representation works end-to-end across your toolchain (Python, JS, DuckDB WASM)