Design¶
Problem statement¶
The library is meant to solve a recurring problem in C and C++ Unicode handling: too many APIs operate on raw strings with important preconditions left implicit.
Typical pain points are:
- validation is separate from the type that is later passed around
- invalid UTF-8, UTF-16, or UTF-32 can survive too long in ordinary string types
- boundary-sensitive operations rely on callers remembering byte or code-unit rules
- error handling is inconsistent across libraries and easy to lose when composing APIs
unicode_ranges takes a different approach:
- validate once at construction
- encode the result in dedicated lightweight types with clear invariants
- keep those invariants stable across later operations
Once you have a utf8_char, utf16_char, utf32_char, utf8_string_view, utf16_string_view, utf32_string_view, utf8_string, utf16_string, or utf32_string, you are working with validated text, not with raw storage that might or might not be valid.
Core model¶
The library is built around a few explicit rules:
- Unicode scalar values are the semantic model.
- UTF-8, UTF-16, and UTF-32 are all first-class encodings.
- Checked and unchecked APIs are kept distinct.
- Borrowed and owning types are separate.
- Performance matters, especially on ASCII-heavy paths, but not at the expense of Unicode correctness.
Compiled runtime backend¶
The library has a compiled runtime backend. The hot runtime UTF boundary operations live in the compiled unicode_ranges library target and use simdutf as the backend for:
- UTF-8, UTF-16, and UTF-32 validation
- UTF-8, UTF-16, and UTF-32 transcoding on runtime paths
- UTF-8/UTF-16 character counting and selected ASCII-only checks
This is a pragmatic design decision. In the comparative benchmark suite, simdutf has been the strongest raw UTF codec baseline, so the library uses it through its public API instead of re-implementing the same runtime dispatch ladder itself.
That backend choice does not change the core model:
- the public API is still
unicode_ranges - validated types and higher-level algorithms still belong to
unicode_ranges - compile-time and
constexpr-oriented behavior remains implemented locally - the
simdutfdependency is specifically about the runtime hot path for contiguous UTF validation, transcoding, counting, and ASCII scans
Ownership model¶
utf8_string_view/utf16_string_view/utf32_string_viewborrow validated storage.utf8_string/utf16_string/utf32_stringown validated storage.- range-returning APIs such as
chars()andgraphemes()borrow from the source text.
This makes lifetime and mutation rules explicit instead of implicit.
Rvalue aware API¶
The library calls the copy-producing and consuming overload pattern the rvalue aware API. Owning-string methods that return another owning string and expose both const& and && receiver overloads are rvalue aware methods.
auto new_string = my_string.operation(); // keep my_string unchanged
my_string = std::move(my_string).operation(); // consume and assign back
The const& form returns a new owning string and leaves the source unchanged. The && form treats the source as disposable and tries to optimize the operation by reusing the existing buffer when that is safe and useful.
The pattern is useful because it gives one logical operation one name while still supporting both ownership modes:
- it avoids parallel
operation_copy()/operation_inplace()/operation_owned()method families for the same transformation - it keeps ordinary lvalue calls non-mutating, so accidental source modification requires an explicit
std::move - it makes consuming chains efficient because each step receives an owning rvalue and can reuse storage when the operation shape allows it
- it preserves API readability: the operation name describes the text transformation, and the receiver category describes whether the source may be consumed
This is an optimization opportunity, not a no-allocation guarantee. A consuming overload may still allocate when the operation expands the text, allocator or capacity constraints require it, or a fresh result is the safer implementation. After an && call, the moved-from owning string remains valid but has an unspecified value; use the returned object.
Code units, scalars, and graphemes¶
The library exposes all three levels:
- code units: raw UTF-8 bytes, UTF-16 code units, or UTF-32 code points
- scalars: Unicode scalar values
- graphemes: user-perceived characters under default Unicode grapheme-cluster rules
That distinction is why the API surface contains both:
size()char_count()grapheme_count()
and both:
substr(...)grapheme_substr(...)
Borrowed views and owning strings intentionally differ for bound-adjusting APIs. Calling strip_*, trim_*, substr, or grapheme_substr on a view returns another view. Calling the same API on an owning string returns an owning string, and rvalue owning receivers reuse their existing buffer when the operation is just a prefix/suffix bound adjustment. APIs that only make sense as borrowed subview producers, such as split_once, rsplit_once, split_once_at, and grapheme_at, are deleted on owning rvalues to prevent dangling results.
Iteration and encoded storage¶
Encoded strings do not model ranges directly¶
utf8_string_view, utf16_string_view, utf32_string_view, and their owning counterparts intentionally do not expose direct range-based iteration.
That is deliberate: for (auto x : text) is ambiguous for encoded Unicode text. The obvious candidates are:
- raw UTF code units
- Unicode scalar values
- grapheme clusters
The library requires that choice to be explicit instead of silently picking one interpretation.
base() is the raw-storage escape hatch¶
When callers explicitly want the encoded storage, the API exposes base().
That member is named base() rather than bytes() because the same surface is shared across UTF-8, UTF-16, and UTF-32:
- for UTF-8,
base()exposes the underlyingstd::u8string_view/std::u8string - for UTF-16,
base()exposes the underlyingstd::u16string_view/std::u16string - for UTF-32,
base()exposes the underlyingstd::u32string_view/std::u32string
The name is intentionally generic because the concept is "underlying validated storage", not specifically "bytes". For UTF-32, that storage is still just the underlying UTF-32 code-unit sequence; it only happens to line up 1:1 with the represented scalar values.
chars() is explicit scalar iteration¶
chars() and reversed_chars() return dedicated view types over Unicode scalar values.
These views are created through member functions only:
- their constructors are not part of the public user-facing construction path
- the library does not expose a
range_adaptor_closure-style pipe API for them
That keeps scalar iteration discoverable and explicit at the string/view API boundary.
Construction wraps existing validated storage and is cheap. Code that depends on exact construction complexity rather than the documented view semantics is relying on an implementation detail.
The iteration strength depends on the encoding:
- UTF-8 and UTF-16 scalar iteration are forward ranges
- UTF-32 scalar iteration is random-access
graphemes() returns borrowed text slices¶
graphemes() returns a forward view whose elements are encoding-matched string views:
utf8_string_viewslices for UTF-8 textutf16_string_viewslices for UTF-16 textutf32_string_viewslices for UTF-32 text
Each element represents one grapheme cluster under the default Unicode grapheme-cluster rules.
There is intentionally no reversed_graphemes() companion. Default grapheme boundaries are discovered in forward order, so a practical reverse-grapheme implementation has to split forward, store the grapheme slices, and then iterate that storage backwards. That allocation-heavy behavior is too surprising for view creation. Callers that want those semantics can materialize graphemes() into a container and reverse that container explicitly.
Alternatives considered for grapheme iteration¶
Two obvious alternatives were considered and rejected.
Returning a dedicated grapheme value type, analogous to utf8_char, would add another abstraction layer without enough clear payoff. In most places, a borrowed string-view slice already communicates the right semantics, and the _grapheme_utf8, _grapheme_utf16, and _grapheme_utf32 literals cover the "single grapheme value" use case.
Returning owning strings instead of borrowed slices would solve some lifetime problems and would often fit inside small-string optimization for short graphemes, but it would also make the common iteration path heavier. When ownership is actually needed, callers can materialize it explicitly with a simple transform step.
Checked versus unchecked APIs¶
Checked construction validates input and reports structured errors. Unchecked construction exists, but it is intentionally named as such:
from_bytes(...)from_bytes_unchecked(...)char_at(...)char_at_unchecked(...)
The unchecked APIs are there for callers that already proved validity elsewhere and want to skip redundant checks. This is the core "validate once, operate without worry" rule of the library: checked APIs establish the invariant, and unchecked APIs are the explicit escape hatch when that invariant is already known by other means.
ASCII fast paths and Unicode correctness¶
The library exposes both Unicode-aware and ASCII-only classification and transform APIs. That split is intentional:
- Unicode-aware operations remain table-driven and correct across the supported Unicode version.
- ASCII-only operations stay cheap, explicit, and unsurprising.
This is why APIs are named separately, such as:
to_lowercase()versusto_ascii_lowercase()is_alphabetic()versusis_ascii_alphabetic()
constexpr as a design goal¶
Many literals and core operations are meant to remain usable in constant evaluation. That influences the implementation style across:
- validated literal operators
- character decoding and encoding
- Unicode property lookup
- grapheme segmentation
The non-constexpr surface is intentionally the exception, not the rule. The main outliers are operation families that require runtime services outside the library's compile-time tables:
- ICU-backed locale-aware casing and caseless comparison, such as
to_lowercase(locale),to_uppercase(locale),to_titlecase(locale),case_fold(locale), and the locale-aware*_ignore_case(...)helpers - runtime locale availability probing through
is_available_locale(...) - formatting, printing, and streaming integration, including
std::formatterspecializations,std::print/std::printlnusage, andoperator<< - hashing integration through
std::hashspecializations
The compiled runtime backend is also runtime-only internally. That does not make the public checked UTF APIs non-constexpr; when an API is declared constexpr, constant evaluation uses the library's compile-time path instead of the runtime backend.
You do not pay for what you do not use¶
The library tries to keep costs explicit instead of hidden:
- checked and unchecked entry points are separate
- borrowed and owning types are separate
- ASCII-only and Unicode-aware operations are separate
- scalar iteration and grapheme iteration are separate
That separation is deliberate. Callers who need full validation and Unicode semantics can opt into them directly. Callers who already have validated text or only need ASCII behavior do not have to keep paying for heavier paths at every call site.
Scope boundaries¶
Supported:
- validated UTF-8, UTF-16, and UTF-32 text handling
- Unicode predicates
- default grapheme segmentation
- Unicode casing and normalization
- optional ICU-backed locale-aware casing
- formatting / streaming / hashing for library-defined types
Out of scope:
- locale-aware collation
- built-in locale-specific casing tables without ICU
- bidi or layout/shaping engines
- regex engines
- tailored segmentation rules beyond the default grapheme algorithm