Unicode Data¶

Unicode version¶

The library exposes:

inline constexpr std::tuple<std::size_t, std::size_t, std::size_t> unicode_version;

That constant reflects the generated Unicode tables checked into the repository.

Generated version: Unicode 17.0.0.

What is generated¶

The generated tables cover the Unicode data needed by the library's public behavior, including:

scalar classification predicates
grapheme segmentation properties
Unicode casing tables
normalization decomposition and composition data
full case-fold mappings

The generated constexpr table output lives in unicode_ranges/unicode_tables_constexpr.hpp. unicode_ranges/unicode_tables.hpp remains a thin compatibility wrapper.

Source data¶

The update pipeline consumes official Unicode Character Database inputs under tools/unicode_data/<version>/.

Important pipeline files include:

UnicodeData.txt
CompositionExclusions.txt
CaseFolding.txt

Updating Unicode data¶

Typical workflow:

Refresh the raw Unicode data under tools/unicode_data/<version>/.
Rerun tools/regenerate_unicode_tables.ps1.
Commit the regenerated unicode_ranges/unicode_tables_constexpr.hpp.
Update any affected documentation.

Why tables are checked in¶

The library aims to stay easy to consume without a build-time generator dependency. Checking in the generated tables keeps usage simple for downstream users while still allowing the Unicode pipeline to remain explicit and reproducible.

Notes on semantics¶

Grapheme segmentation follows default Unicode rules.
Unicode casing is locale-independent.
Case folding uses full Unicode case-fold mappings, but not locale-specific tailorings.
Normalization supports NFC, NFD, NFKC, and NFKD.