Storing H3 Efficiently: BIGINT vs String, Dictionary Encoding, and Parquet Compression Tips
Storing H3 Efficiently: BIGINT vs String, Dictionary Encoding, and Parquet Compression Tips
If you store global time-series data on H3 (climate indices, mobility, land cover, risk scores), you quickly discover that how you store the H3 index can make or break your system.
H3 IDs show up in almost every query:
- joins (cell → metrics, cell → admin)
- grouping (rollups to parents)
- filtering (viewport/time partitions)
- exporting (Parquet/GeoParquet)
This post is a practical deep dive into:
- H3 index as string vs BIGINT
- why dictionary encoding matters
- Parquet compression tips that actually help
- schema patterns that are friendly for DuckDB, Spark/Arrow/Polars, and browser use
All code blocks are indented to avoid Markdown-breaking nested fences.
What exactly is an “H3 index” in storage terms?
An H3 cell identifier is typically represented in two common ways:
- string form (hex string, e.g.
8928308280fffff) - integer form (often called “H3 bigint” or “H3 int64” representation)
Both represent the same cell, but storage and performance characteristics differ.
Your choice impacts:
- file size and compression ratio
- join and group-by performance
- interoperability across languages/tools
- how easily you can compute parent/children relationships
BIGINT vs String: the tradeoffs
String (hex) H3 index
Pros:
- human-readable and easy to debug
- very common in geospatial tooling and APIs
- stable interchange format (copy/paste friendly)
- avoids integer signedness edge cases across languages
Cons:
- typically larger on disk than an integer
- can be slower to join/group at very large scale (string comparisons)
- more memory overhead in some runtimes
BIGINT (int64) H3 index
Pros:
- compact fixed-width representation
- fast joins and group-bys in many engines
- often compresses extremely well in Parquet
- good for high-volume time-series fact tables
Cons:
- not human-readable (harder debugging)
- signed/unsigned nuance across languages (some ecosystems treat it differently)
- some external tools expect string form, so you may need conversion
Practical recommendation: store both (but in the right places)
A common, production-friendly pattern is:
- store BIGINT in the fact tables (metrics, time series)
- store string only where needed for interoperability or UI
- maintain a dimension table mapping
h3_bigint ↔ h3_string
This gives you the best of both worlds:
- fast joins/groupbys for analytics
- easy export/debug when needed
Schema patterns that scale
Pattern A: fact table (metrics) with BIGINT
h3_bigint BIGINT -- primary join key
date DATE
spi_3 DOUBLE
spei_3 DOUBLE
...
This is your high-volume table. Make it as compact as possible.
Pattern B: dimension table with both representations
h3_bigint BIGINT
h3_index VARCHAR -- the hex string
res INTEGER -- optional: store resolution
...
Optionally add geometry (but keep it separate if you can):
cells_geomtable keyed byh3_bigintorh3_index
Pattern C: lookup tables for admin joins
h3_bigint BIGINT
admin_id INTEGER
weight DOUBLE -- optional area fraction
This allows fast region summaries without geometry at runtime.
Dictionary encoding: why repeated H3 values compress well
Parquet supports multiple encodings, and a very important one for categorical-like columns is dictionary encoding.
If a column repeats the same values many times, dictionary encoding stores:
- a dictionary of unique values
- integer codes referencing the dictionary for each row
Where H3 shows up:
- time series tables repeat the same H3 IDs across many dates
- lookup tables may repeat admin IDs across many H3 cells
- “res” column repeats constantly
That means dictionary encoding can drastically reduce size and speed up scans.
Best case: time-series with repeated H3 cells
If you have:
- 1 million H3 cells
- 500 monthly timestamps
You have 500 million rows, but only 1 million unique H3 IDs. That’s perfect for dictionary encoding.
The net effect:
- H3 column compresses like a categorical column
- scanning becomes cheaper because fewer bytes need to be read
When strings can still compress surprisingly well
Even though strings are larger, Parquet can still compress them effectively when:
- values share prefixes (they often do by resolution patterns)
- dictionary encoding is applied
- compression codec is strong
But in practice, BIGINT is still usually more compact and faster for heavy analytics.
Compression codecs: what matters in practice
Parquet compression codecs commonly used:
- Snappy: fast, widely supported, good default
- ZSTD: better compression ratio, still fast enough in many cases
- GZIP: strong compression but slower; less common for analytics workflows
Practical guidance:
- use Snappy when you prioritize speed and wide interoperability
- use ZSTD when you want smaller files (especially for object storage + browser delivery)
If your workflow includes DuckDB WASM or CDN delivery, smaller files can be worth it.
Parquet layout tips for H3 datasets
Tip 1: partition by time (and maybe by resolution)
Time is the most common filter.
metrics/
res=6/
year=2025/
month=07/
part-0000.parquet
This allows partition pruning so you don’t scan unnecessary months.
Tip 2: keep geometry out of the big fact table
Geometry columns are heavy. Don’t put them in the same Parquet as time-series metrics unless you truly need them.
Use:
metrics.parquet(no geometry)cells_geom.parquet(geometry once)- join when needed
Tip 3: keep columns narrow and typed
Avoid “stringly typed” schemas.
- store
dateas DATE - store indicators as DOUBLE/FLOAT
- store ids as BIGINT/INTEGER
Good typing improves compression and query performance.
Tip 4: consider sorting or clustering by H3
Many engines benefit when related rows are physically close.
If you write Parquet so that rows are grouped by:
h3_bigint, thendate
…you can improve:
- dictionary encoding efficiency
- row group pruning
- join locality in some workloads
Even if you don’t explicitly “sort”, writing data in a consistent order often helps.
Tip 5: compute and store resolution as a separate column
If you store multi-resolution datasets, it helps to carry res explicitly:
- easier sanity checks
- easier validation
- easier debugging across partitions
Interoperability: where BIGINT can bite you
Not all ecosystems handle 64-bit integers equally.
Potential issues:
- JavaScript Number cannot exactly represent all 64-bit integers
- some JSON serializers lose precision if you export BIGINT directly
- some tools treat int64 as signed while others use unsigned semantics
Practical workarounds:
- in APIs: send H3 as string
- in analytics: store as BIGINT
- in the browser: keep
h3_index(string) as the exchange format
That’s why the “store both” pattern is so useful.
Joining strategy: why BIGINT often wins
On very large datasets, joins on BIGINT typically outperform joins on VARCHAR.
A recommended approach:
- convert incoming string IDs once
- then join on BIGINT internally
If your ingestion starts with strings, build the dimension mapping and normalize early.
DuckDB-friendly workflow: normalize and publish
A typical build step:
- ingest raw data
- generate or attach
h3_bigint - write metrics Parquet (BIGINT key)
- write dimension Parquet (bigint ↔ string)
- write geometry Parquet (optional)
Example conceptual normalization step:
CREATE OR REPLACE TABLE metrics_norm AS
SELECT
h3_bigint,
date,
spi_3,
spei_3
FROM metrics_raw;
Then publish:
COPY metrics_norm
TO 'output/metrics/res=6/year=2025/month=07/'
(FORMAT PARQUET, PARTITION_BY (year, month));
(Your exact partition columns depend on how you model time.)
A practical “do this” checklist
If you’re building an H3-based analytics product:
- Use BIGINT for internal joins and aggregations
- Keep a mapping table to string IDs for exports/APIs/UI
- Keep geometry separate from time-series metrics
- Partition Parquet primarily by time (and by resolution if multi-res)
- Prefer Snappy for speed, ZSTD for smaller files (especially for object storage/browser)
- Validate that your chosen representation works end-to-end across your toolchain (Python, JS, DuckDB WASM)