Skip to content

Error Model

UTF-8 validation

Checked UTF-8 construction reports:

enum class utf8_error_code
{
    invalid_lead_byte,
    truncated_sequence,
    invalid_sequence
};

struct utf8_error
{
    utf8_error_code code{};
    std::size_t first_invalid_byte_index = 0;
};

These are returned by checked APIs such as utf8_string_view::from_bytes(...) and utf8_string::from_bytes(...).

At runtime, the hot UTF-8 validation and checked UTF-8 transcoding paths currently use simdutf underneath. unicode_ranges still presents its own utf8_error and utf8_error_code surface; the runtime backend result is mapped into this library-specific error model before it reaches the caller.

Example:

const std::array<char8_t, 3> invalid{
    static_cast<char8_t>(0xE2),
    static_cast<char8_t>(0x28),
    static_cast<char8_t>(0xA1)
};

auto text = unicode_ranges::utf8_string_view::from_bytes(
    { invalid.data(), invalid.size() });

assert(!text);
assert(text.error().code == unicode_ranges::utf8_error_code::invalid_sequence);
assert(text.error().first_invalid_byte_index == 0);

UTF-16 validation

Checked UTF-16 construction reports:

enum class utf16_error_code
{
    truncated_surrogate_pair,
    invalid_sequence
};

struct utf16_error
{
    utf16_error_code code{};
    std::size_t first_invalid_code_unit_index = 0;
};

These are returned by checked APIs such as utf16_string_view::from_code_units(...) and utf16_string::from_code_units(...).

Checked factories versus unchecked constructors

The library distinguishes between:

  • checked factories that validate incoming raw input and return std::expected
  • unchecked construction APIs that assume the caller already proved validity

Use the unchecked APIs only when validity is already guaranteed by the caller or by a surrounding protocol.

Bounds and semantic errors

Beyond construction-time validation, checked text operations may throw standard exceptions such as std::out_of_range for invalid bounds or boundary misuse. Typical examples include:

  • offsets that are out of range
  • offsets that do not land on a valid character boundary
  • subranges that violate API preconditions

Unchecked variants exist where skipping those checks is intentional.