Skip to content

Design

Problem statement

The library is meant to solve a recurring problem in C and C++ Unicode handling: too many APIs operate on raw strings with important preconditions left implicit.

Typical pain points are:

  • validation is separate from the type that is later passed around
  • invalid UTF-8, UTF-16, or UTF-32 can survive too long in ordinary string types
  • boundary-sensitive operations rely on callers remembering byte or code-unit rules
  • error handling is inconsistent across libraries and easy to lose when composing APIs

unicode_ranges takes a different approach:

  • validate once at construction
  • encode the result in dedicated lightweight types with clear invariants
  • keep those invariants stable across later operations

Once you have a utf8_char, utf16_char, utf32_char, utf8_string_view, utf16_string_view, utf32_string_view, utf8_string, utf16_string, or utf32_string, you are working with validated text, not with raw storage that might or might not be valid.

Core model

The library is built around a few explicit rules:

  • Unicode scalar values are the semantic model.
  • UTF-8, UTF-16, and UTF-32 are all first-class encodings.
  • Checked and unchecked APIs are kept distinct.
  • Borrowed and owning types are separate.
  • Performance matters, especially on ASCII-heavy paths, but not at the expense of Unicode correctness.

Compiled runtime backend

The library now has a compiled runtime backend. The hot runtime UTF boundary operations live in the compiled unicode_ranges library target and use simdutf as the backend for:

  • UTF-8 validation
  • UTF-8 -> UTF-16 transcoding
  • UTF-8 -> UTF-32 transcoding

This is a pragmatic design decision. In the comparative benchmark suite, simdutf has been the strongest raw UTF codec baseline, so the library now uses it through its public API instead of re-implementing the same runtime dispatch ladder itself.

That backend choice does not change the core model:

  • the public API is still unicode_ranges
  • validated types and higher-level algorithms still belong to unicode_ranges
  • compile-time and constexpr-oriented behavior remains implemented locally
  • the simdutf dependency is specifically about the runtime hot path for contiguous UTF validation/transcoding

Ownership model

  • utf8_string_view / utf16_string_view / utf32_string_view borrow validated storage.
  • utf8_string / utf16_string / utf32_string own validated storage.
  • range-returning APIs such as chars() and graphemes() borrow from the source text.

This makes lifetime and mutation rules explicit instead of implicit.

Code units, scalars, and graphemes

The library exposes all three levels:

  • code units: raw UTF-8 bytes, UTF-16 code units, or UTF-32 code points
  • scalars: Unicode scalar values
  • graphemes: user-perceived characters under default Unicode grapheme-cluster rules

That distinction is why the API surface contains both:

  • size()
  • char_count()
  • grapheme_count()

and both:

  • substr(...)
  • grapheme_substr(...)

Iteration and encoded storage

Encoded strings do not model ranges directly

utf8_string_view, utf16_string_view, utf32_string_view, and their owning counterparts intentionally do not expose direct range-based iteration.

That is deliberate: for (auto x : text) is ambiguous for encoded Unicode text. The obvious candidates are:

  • raw UTF code units
  • Unicode scalar values
  • grapheme clusters

The library requires that choice to be explicit instead of silently picking one interpretation.

base() is the raw-storage escape hatch

When callers explicitly want the encoded storage, the API exposes base().

That member is named base() rather than bytes() because the same surface is shared across UTF-8, UTF-16, and UTF-32:

  • for UTF-8, base() exposes the underlying std::u8string_view / std::u8string
  • for UTF-16, base() exposes the underlying std::u16string_view / std::u16string
  • for UTF-32, base() exposes the underlying std::u32string_view / std::u32string

The name is intentionally generic because the concept is "underlying validated storage", not specifically "bytes". For UTF-32, that storage is still just the underlying UTF-32 code-unit sequence; it only happens to line up 1:1 with the represented scalar values.

chars() is explicit scalar iteration

chars() and reversed_chars() return dedicated view types over Unicode scalar values.

These views are created through member functions only:

  • their constructors are not part of the public user-facing construction path
  • the library does not expose a range_adaptor_closure-style pipe API for them

That keeps scalar iteration discoverable and explicit at the string/view API boundary.

Construction is currently O(1) because the returned view wraps existing validated storage, but that is current behavior rather than a promised long-term complexity guarantee.

The iteration strength depends on the encoding:

  • UTF-8 and UTF-16 scalar iteration are forward ranges
  • UTF-32 scalar iteration is random-access

graphemes() returns borrowed text slices

graphemes() returns a forward view whose elements are encoding-matched string views:

  • utf8_string_view slices for UTF-8 text
  • utf16_string_view slices for UTF-16 text
  • utf32_string_view slices for UTF-32 text

Each element represents one grapheme cluster under the default Unicode grapheme-cluster rules.

There is intentionally no reversed_graphemes() companion today. Reverse grapheme iteration needs different machinery and tradeoffs, and the library does not currently want to standardize that surface prematurely.

Alternatives considered for grapheme iteration

Two obvious alternatives were considered and rejected.

Returning a dedicated grapheme value type, analogous to utf8_char, would add another abstraction layer without enough clear payoff. In most places, a borrowed string-view slice already communicates the right semantics, and the _grapheme_utf8, _grapheme_utf16, and _grapheme_utf32 literals cover the "single grapheme value" use case.

Returning owning strings instead of borrowed slices would solve some lifetime problems and would often fit inside small-string optimization for short graphemes, but it would also make the common iteration path heavier. When ownership is actually needed, callers can materialize it explicitly with a simple transform step.

Checked versus unchecked APIs

Checked construction validates input and reports structured errors. Unchecked construction exists, but it is intentionally named as such:

  • from_bytes(...)
  • from_bytes_unchecked(...)
  • char_at(...)
  • char_at_unchecked(...)

The unchecked APIs are there for callers that already proved validity elsewhere and want to skip redundant checks. This is the core "validate once, operate without worry" rule of the library: checked APIs establish the invariant, and unchecked APIs are the explicit escape hatch when that invariant is already known by other means.

ASCII fast paths and Unicode correctness

The library exposes both Unicode-aware and ASCII-only classification and transform APIs. That split is intentional:

  • Unicode-aware operations remain table-driven and correct across the supported Unicode version.
  • ASCII-only operations stay cheap, explicit, and unsurprising.

This is why APIs are named separately, such as:

  • to_lowercase() versus to_ascii_lowercase()
  • is_alphabetic() versus is_ascii_alphabetic()

constexpr as a design goal

Many literals and core operations are meant to remain usable in constant evaluation. That influences the implementation style across:

  • validated literal operators
  • character decoding and encoding
  • Unicode property lookup
  • grapheme segmentation

Not every operation is constexpr, but it is a deliberate design target rather than an accidental bonus.

You do not pay for what you do not use

The library tries to keep costs explicit instead of hidden:

  • checked and unchecked entry points are separate
  • borrowed and owning types are separate
  • ASCII-only and Unicode-aware operations are separate
  • scalar iteration and grapheme iteration are separate

That separation is deliberate. Callers who need full validation and Unicode semantics can opt into them directly. Callers who already have validated text or only need ASCII behavior do not have to keep paying for heavier paths at every call site.

Scope boundaries

Supported:

  • validated UTF-8, UTF-16, and UTF-32 text handling
  • Unicode predicates
  • default grapheme segmentation
  • Unicode casing and normalization
  • optional ICU-backed locale-aware casing
  • formatting / streaming / hashing for library-defined types

Out of scope:

  • locale-aware collation
  • built-in locale-specific casing tables without ICU
  • bidi or layout/shaping engines
  • regex engines
  • tailored segmentation rules beyond the default grapheme algorithm