unicode_ranges¶

unicode_ranges is a C++23 library for validated UTF-8, UTF-16, and UTF-32 text.

It is built around a simple idea: Unicode scalar values are the canonical model, while UTF-8, UTF-16, and UTF-32 remain first-class encodings with dedicated APIs. The library gives you validated characters, borrowed views, owning strings, scalar iteration, grapheme iteration, Unicode casing, normalization, formatting support, and conversion between encodings.

Why this library exists¶

Existing C and C++ text handling often starts from raw byte buffers, raw code-unit strings, or APIs with preconditions that are easy to violate and hard to see at the call site. Validation rules, boundary rules, and error handling are frequently left to documentation and convention instead of being carried by the type system.

unicode_ranges exists to push that cost to the edge:

validate once, then operate with invariants
represent UTF-8, UTF-16, and UTF-32 text with lightweight dedicated types instead of "maybe valid" raw strings
make invalid states unrepresentable once construction succeeds
keep construction available both at compile time through validated literals and at runtime through checked factories
still expose explicit unchecked fast paths when the caller has already proved validity elsewhere

The design goal is not "maximum abstraction". It is predictable Unicode handling with clear invariants, explicit failure modes, and no repeated worry about whether a value is valid text.

The public surface is header-first, but the runtime UTF hot paths live in the compiled unicode_ranges library target, built from unicode_ranges.cpp and backed by pinned vendored simdutf (v7.7.0) under third_party/simdutf. Consumers link that library target, or an equivalent library in their own build. There is no separate simdutf include-path step for normal use.

That backend choice is intentional: simdutf has been the strongest raw UTF validation/transcoding baseline in the comparative benchmark suite, so unicode_ranges uses it directly for those runtime hot paths plus selected counting and ASCII-scan paths while keeping the higher-level validated type model and the rest of the Unicode algorithms in unicode_ranges itself.

New users: start here¶

Install And Integrate: how to consume the library from a build system.
Getting Started: include, validate, and use the core types quickly.
Common Tasks: validate input, iterate scalars versus graphemes, normalize, case-map, and convert encodings.
Design: ownership, indexing, boundaries, and what the library treats as a character.
Boundary Encodings: built-in codecs, custom encoder/decoder requirements, generated APIs, and boundary-specific error handling.
Benchmarking: cross-library benchmark policy, comparison rules, toolchain matrix, and benchmark families.
Licensing: repository dual-license model, runtime dependency license notes, and third-party notices.
Stability policy: the repository root STABILITY.md defines the intended support and compatibility surface while the API stabilizes.
Text Operations: search, split, trim, replace, reverse, and boundary queries.
Casing and Normalization: Unicode casing, case folding, and normalization forms.
Reference: grouped API reference by type family.

Type map¶

Category	UTF-8	UTF-16	UTF-32
Character	`utf8_char`	`utf16_char`	`utf32_char`
Borrowed text	`utf8_string_view`	`utf16_string_view`	`utf32_string_view`
Owning text	`utf8_string`	`utf16_string`	`utf32_string`
Forward scalar iteration	`views::utf8_view`	`views::utf16_view`	`views::utf32_view`
Reverse scalar iteration	`views::reversed_utf8_view`	`views::reversed_utf16_view`	`views::reversed_utf32_view`
Grapheme iteration	`views::grapheme_cluster_view<char8_t>`	`views::grapheme_cluster_view<char16_t>`	`views::grapheme_cluster_view<char32_t>`
Lossy iteration	`views::lossy_utf8_view<CharT>`	`views::lossy_utf16_view<CharT>`	`views::lossy_utf32_view<CharT>`

Public entry point¶

#include "unicode_ranges_borrowed.hpp"

Everything public lives in namespace unicode_ranges. Literal operators live in unicode_ranges::literals. PMR owning-string aliases live in unicode_ranges::pmr.

Use unicode_ranges_all.hpp if you want the all-in umbrella, including owning strings and unicode_ranges::characters.

Warning

unicode_ranges::details is implementation detail only. It is not part of the supported public API.

What the library is optimized for¶

Validated text types instead of raw std::u8string_view / std::u16string_view
Predictable, STL-style APIs with Rust-inspired Unicode ergonomics
Separate checked, unchecked, Unicode-aware, and ASCII-only paths so you do not pay for what you do not use
constexpr-friendly literals and core operations where practical
Table-driven Unicode properties and grapheme segmentation
Fast ASCII paths without degrading Unicode correctness
Optional ICU-backed locale-aware casing when ICU is enabled

What it does not try to cover¶

Locale-aware collation
Built-in locale-specific casing tables without ICU
Bidirectional layout, shaping, or font/layout work
Regex or full text-search engines
Tailored segmentation beyond default Unicode grapheme rules

Those are deliberate scope boundaries rather than omissions by accident.