Boundary Encodings¶

The boundary-encoding API extends the validated UTF string types at external encode and decode boundaries without changing the library's core model.

This page documents the exact public surface. For the higher-level guide to built-in codecs, custom codec requirements, guarantees, and error handling, see Boundary Encodings.

#include <cassert>
#include <array>
#include <vector>

#include "unicode_ranges_all.hpp"

using namespace unicode_ranges;
using namespace unicode_ranges::literals;

int main()
{
    auto decoded = utf8_string::from_encoded<encodings::ascii_strict>(u8"Hello");
    assert(decoded);

    auto strict_bytes = decoded->to_encoded<encodings::ascii_strict>();
    assert(strict_bytes);
    assert(*strict_bytes == u8"Hello");

    std::array<char8_t, 5> bounded{};
    encodings::ascii_strict strict{};
    auto wrote_bounded = decoded->encode_to(std::span<char8_t>{ bounded }, strict);
    assert(wrote_bounded);
    const std::u8string_view bounded_view{ bounded.data(), bounded.size() };
    assert(bounded_view == u8"Hello");

    const std::array<char8_t, 8> windows_input{
        static_cast<char8_t>('P'),
        static_cast<char8_t>('r'),
        static_cast<char8_t>('i'),
        static_cast<char8_t>('c'),
        static_cast<char8_t>('e'),
        static_cast<char8_t>(':'),
        static_cast<char8_t>(' '),
        static_cast<char8_t>(0x80u)
    };
    const auto windows = utf8_string::from_encoded<encodings::windows_1252>(
        std::u8string_view{ windows_input.data(), windows_input.size() });
    assert(windows.base() == u8"Price: \u20AC");
    auto windows_encoded = windows.to_encoded<encodings::windows_1252>();
    assert(windows_encoded);
    const std::u8string expected_windows_bytes{ windows_input.begin(), windows_input.end() };
    assert(*windows_encoded == expected_windows_bytes);

    std::vector<char8_t> lossy_bytes{ static_cast<char8_t>('>') };
    encodings::ascii_lossy lossy{};
    u8"Café"_utf8_sv.to_utf8_owned().encode_append_to(lossy_bytes, lossy);
    assert((lossy_bytes == std::vector<char8_t>{
        static_cast<char8_t>('>'),
        static_cast<char8_t>('C'),
        static_cast<char8_t>('a'),
        static_cast<char8_t>('f'),
        static_cast<char8_t>('?') }));
    assert(lossy.replacement_count == 1);

    return 0;
}

Header And Namespaces¶

Include unicode_ranges_all.hpp for the full surface.
Include unicode_ranges_borrowed.hpp for the lighter borrowed/core umbrella.
The boundary API lives in namespace unicode_ranges.
Built-in codecs currently live in namespace unicode_ranges::encodings.

Core Types¶

Synopsis¶

template <typename Encoder>
struct encoder_traits;

template <typename Decoder>
struct decoder_traits;

template <typename T>
concept encoder = /* exposition only */;

template <typename T>
concept decoder = /* exposition only */;

template <typename Decoder, typename UtfString>
using from_encoded_result = /* UtfString or std::expected<UtfString, decode_error> */;

template <typename Encoder, typename OutputAllocator = std::allocator<typename encoder_traits<Encoder>::code_unit_type>>
using to_encoded_result = /* string or std::expected<string, encode_error> */;

enum class encode_to_error_kind {
    overflow,
    encoding_error
};

template <typename Encoder>
struct encode_to_error;

struct codec_contract_violation : std::logic_error {};

Behavior¶

encoder_traits and decoder_traits normalize codec objects into the surface the library actually calls.
The traits layer always provides flush(...) even when the codec object does not define it.
Decoder code_unit_type must be usable with std::basic_string_view<code_unit_type>.
to_encoded(...) additionally requires an encoded code_unit_type that can back std::basic_string<code_unit_type, std::char_traits<code_unit_type>, ...>.
from_encoded_result and to_encoded_result follow the optional-error-alias rule:
no decode_error / encode_error alias means a direct UTF value or direct encoded string
defining the alias switches the corresponding family to std::expected
encode_to_error_kind::overflow is library-owned bounded-sink exhaustion
encode_to_error_kind::encoding_error wraps the codec's encode_error
codec_contract_violation is reserved for codec bugs when contract checks are enabled

Contract checks¶

UTF8_RANGES_ENABLE_CODEC_CONTRACT_CHECKS defaults to:

1 in debug builds
0 in release builds

When enabled, contract violations throw codec_contract_violation in exception-enabled builds and terminate in no-exception builds. When disabled, violating the codec contract is undefined behavior.

Codec Objects¶

Minimum shape¶

struct my_encoder {
    using code_unit_type = char8_t;

    template <typename Writer>
    void encode_one(char32_t scalar, Writer out);
};

struct my_decoder {
    using code_unit_type = char8_t;

    template <typename Writer>
    std::size_t decode_one(std::basic_string_view<char8_t> input, Writer out);
};

Optional additions¶

using encode_error = /* ... */;
using decode_error = /* ... */;
static constexpr bool allow_implicit_construction = true;

template <typename Writer>
void flush(Writer out);

template <typename Writer>
void encode_from_utf8(utf8_string_view input, Writer out);
template <typename Writer>
void encode_from_utf16(utf16_string_view input, Writer out);
template <typename Writer>
void encode_from_utf32(utf32_string_view input, Writer out);

template <typename Writer>
void decode_to_utf8(std::basic_string_view<char8_t> input, Writer out);
template <typename Writer>
void decode_to_utf16(std::basic_string_view<char8_t> input, Writer out);
template <typename Writer>
void decode_to_utf32(std::basic_string_view<char8_t> input, Writer out);

Behavior¶

Codec objects are real mutable objects. Any runtime state lives on the object itself.
Writer parameters are taken by value. The writer is a cheap non-owning handle over external sink state.
Decoder code_unit_type therefore has to be a valid std::basic_string_view element type.
Encoders intended for to_encoded(...) must use a code_unit_type that is also valid for std::basic_string with std::char_traits<code_unit_type>.
If encode_error or decode_error is defined, that alias must be a non-void type.
If encode_error is defined, encode_one(...), flush(...), and any encode_from_utf* hook the codec provides must return std::expected<..., encode_error> instead of the infallible form.
If decode_error is defined, decode_one(...), flush(...), and any decode_to_utf* hook the codec provides must return std::expected<..., decode_error> instead of the infallible form.
using encode_error = void; and using decode_error = void; are not valid; omit the alias entirely for infallible codecs.
allow_implicit_construction is optional.
if omitted, empty default-constructible codecs are treated as implicitly constructible
explicit false opts out
explicit true opts in even for non-empty codecs
If allow_implicit_construction is true but the codec is not default-constructible, the no-object convenience overloads fail with a static assertion because the library must default-construct a temporary codec internally

Example:

// Infallible encoder: no encode_error alias, hooks return plain success values.
struct ascii_encoder {
    using code_unit_type = char8_t;

    template <typename Writer>
    void encode_one(char32_t scalar, Writer out);

    template <typename Writer>
    void flush(Writer out);
};

// Fallible encoder: encode_error exists and the matching hooks switch to expected.
struct strict_legacy_encoder {
    using code_unit_type = char8_t;

    enum class encode_error {
        unrepresentable_scalar
    };

    template <typename Writer>
    std::expected<void, encode_error> encode_one(char32_t scalar, Writer out);

    template <typename Writer>
    std::expected<void, encode_error> flush(Writer out);

    template <typename Writer>
    std::expected<void, encode_error> encode_from_utf8(utf8_string_view input, Writer out);
};

Whole-input contract¶

encode_from_utf8(...), encode_from_utf16(...), encode_from_utf32(...), decode_to_utf8(...), decode_to_utf16(...), and decode_to_utf32(...) are whole-input operations
on success they must consume the full input view they are given
they cannot silently stop early
the surrounding library algorithm still calls flush(...) afterwards

Primitive decode contract¶

decode_one(...) receives the remaining suffix of the original input after previous successful consumption
the returned consumed count is relative to that suffix
on success, consumed count must be non-zero and must not exceed input.size()
once the input is exhausted, the library skips further decode_one(...) calls and proceeds to flush(...)
flush(...) must also be valid when no prior decode_one(...) call occurred, which naturally happens for empty input

Writer Surface¶

Codecs do not write directly to arbitrary containers or iterators. They receive a library writer handle with this logical contract:

struct Writer {
    using unit_type = /* code unit or char32_t scalar, depending on context */;

    void reserve(std::size_t additional_units) const;
    void push(unit_type unit) const;
    void append(std::span<const unit_type> units) const;

    template <std::ranges::input_range R>
        requires std::convertible_to<std::ranges::range_reference_t<R>, unit_type>
    void append(R&& units) const;
};

Behavior¶

Writer copies share the same underlying destination state.
Writers are call-scoped handles and should not be retained by codecs.
Raw bounded writers report overflow through encode_to(...), not by throwing.
Growable container writers propagate ordinary container exceptions.
For container appenders, the implementation prefers:
resize_and_overwrite(...) for suitable span or sized-range appends
append_range(...)
append(ptr, count) for string-like containers
insert_range(end(), ...)
insert(end(), first, last)
repeated push_back / emplace_back / insert(end(), value) with reserve(...) only on that repeated-push fallback

Owning String Boundary Functions¶

The UTF-8, UTF-16, and UTF-32 owning string types expose structurally parallel boundary APIs. The synopsis below uses the UTF-8 family explicitly.

Decode into validated UTF¶

template <typename Decoder>
static constexpr auto from_encoded(
    std::basic_string_view<typename decoder_traits<Decoder>::code_unit_type> input,
    Decoder& decoder,
    const Allocator& alloc = Allocator())
    -> from_encoded_result<Decoder, basic_utf8_string>;

template <typename Decoder>
    requires decoder_traits<Decoder>::allow_implicit_construction_requested
static constexpr auto from_encoded(
    std::basic_string_view<typename decoder_traits<Decoder>::code_unit_type> input,
    const Allocator& alloc = Allocator())
    -> from_encoded_result<Decoder, basic_utf8_string>;

Encode into an owned encoded string¶

template <typename Encoder, typename OutputAllocator = std::allocator<typename encoder_traits<Encoder>::code_unit_type>>
constexpr auto to_encoded(
    Encoder& encoder,
    const OutputAllocator& alloc = OutputAllocator()) const
    -> to_encoded_result<Encoder, OutputAllocator>;

template <typename Encoder, typename OutputAllocator = std::allocator<typename encoder_traits<Encoder>::code_unit_type>>
    requires encoder_traits<Encoder>::allow_implicit_construction_requested
constexpr auto to_encoded(
    const OutputAllocator& alloc = OutputAllocator()) const
    -> to_encoded_result<Encoder, OutputAllocator>;

Encode into a bounded raw sink¶

template <typename Encoder, typename Out>
    requires std::ranges::range<Out>
          && std::ranges::output_range<Out, typename encoder_traits<Encoder>::code_unit_type>
constexpr auto encode_to(Out&& out, Encoder& encoder) const
    -> std::expected<void, encode_to_error<Encoder>>;

template <typename Encoder, typename Out>
    requires encoder_traits<Encoder>::allow_implicit_construction_requested
          && std::ranges::range<Out>
          && std::ranges::output_range<Out, typename encoder_traits<Encoder>::code_unit_type>
constexpr auto encode_to(Out&& out) const
    -> std::expected<void, encode_to_error<Encoder>>;

Append to a growable sequence-like container¶

template <typename Encoder, typename Container>
constexpr auto encode_append_to(Container& container, Encoder& encoder) const
    -> /* void or std::expected<void, encode_error> */;

template <typename Encoder, typename Container>
    requires encoder_traits<Encoder>::allow_implicit_construction_requested
constexpr auto encode_append_to(Container& container) const
    -> /* void or std::expected<void, encode_error> */;

Behavior¶

from_encoded(...) always materializes an owned validated UTF string
to_encoded(...) builds a growable encoded string result
encode_to(...) targets bounded raw sinks such as iterator/sentinel-backed outputs and reports overflow through encode_to_error<Encoder>
encode_append_to(...) appends after the destination container's existing contents and never reports overflow
encode_append_to(...) only participates for sequence-like append containers whose value_type can be constructed from the encoder's code_unit_type
partial output written before overflow or codec failure is preserved
if a growable destination container throws while appending, that exception propagates normally

Built-in Codecs¶

The built-in single-byte codecs follow documented source mappings rather than ad hoc byte tables. The current built-ins use either:

direct identity mapping over U+0000..U+00FF
or a published WHATWG index file

`encodings::ascii_strict`¶

code_unit_type = char8_t
defines both encode_error and decode_error
encodes and decodes only ASCII
reports non-ASCII scalars or bytes as ordinary codec errors
enables implicit construction

`encodings::ascii_lossy`¶

code_unit_type = char8_t
does not define encode_error or decode_error
replaces unrepresentable scalars and invalid bytes with replacement output
tracks replacement counts on the codec object
does not opt into implicit construction, because callers typically care about the mutated codec object afterwards

`encodings::iso_8859_1`¶

code_unit_type = char8_t
defines encode_error, but decoding is infallible
maps bytes 0x00..0xFF directly to Unicode U+0000..U+00FF
encodes only scalars in the Latin-1 range and reports other scalars as ordinary encode errors
enables implicit construction

Source mapping: - direct Latin-1 identity mapping

`encodings::iso_8859_15`¶

code_unit_type = char8_t
defines encode_error, but decoding is infallible
follows the WHATWG ISO-8859-15 index
keeps the Latin-1 shape, but remaps 0xA4, 0xA6, 0xA8, 0xB4, 0xB8, 0xBC, 0xBD, and 0xBE
encodes only scalars in the ISO-8859-15 repertoire and reports other scalars as ordinary encode errors
enables implicit construction

Source table: - WHATWG index: https://encoding.spec.whatwg.org/index-iso-8859-15.txt

`encodings::windows_1251`¶

code_unit_type = char8_t
defines encode_error, but decoding is infallible
follows the WHATWG Windows-1251 index
covers the Windows Cyrillic repertoire, including the WHATWG-preserved control and punctuation slots in the 0x80..0x9F range
encodes ASCII and the Windows-1251 repertoire, and reports other scalars as ordinary encode errors
enables implicit construction

Source table: - WHATWG index: https://encoding.spec.whatwg.org/index-windows-1251.txt

`encodings::windows_1252`¶

code_unit_type = char8_t
defines encode_error, but decoding is infallible
follows the WHATWG Windows-1252 index, not the older undefined-hole vendor mapping
encodes ASCII and the Windows-1252 repertoire, and reports other scalars as ordinary encode errors
decodes bytes 0x81, 0x8D, 0x8F, 0x90, and 0x9D to the corresponding C1 control code points, matching WHATWG
enables implicit construction

Source table: - WHATWG index: https://encoding.spec.whatwg.org/index-windows-1252.txt

Boundary Encodings¶

Header And Namespaces¶

Core Types¶

Synopsis¶

Behavior¶

Contract checks¶

Codec Objects¶

Minimum shape¶

Optional additions¶

Behavior¶

Whole-input contract¶

Primitive decode contract¶

Writer Surface¶

Behavior¶

Owning String Boundary Functions¶

Decode into validated UTF¶

Encode into an owned encoded string¶

Encode into a bounded raw sink¶

Append to a growable sequence-like container¶

Behavior¶

Built-in Codecs¶

encodings::ascii_strict¶

encodings::ascii_lossy¶

encodings::iso_8859_1¶

encodings::iso_8859_15¶

encodings::windows_1251¶

encodings::windows_1252¶

`encodings::ascii_strict`¶

`encodings::ascii_lossy`¶

`encodings::iso_8859_1`¶

`encodings::iso_8859_15`¶

`encodings::windows_1251`¶

`encodings::windows_1252`¶