Boundary Encodings¶

unicode_ranges stays UTF-centric internally, but it can now encode to and decode from external non-UTF formats at the boundary.

This is the layer to use when text must cross an interface that speaks something other than validated UTF-8 / UTF-16 / UTF-32:

legacy single-byte encodings
protocol-specific byte formats
application-defined encodings
bounded output buffers
growable byte containers

The core rule stays the same:

validated UTF types are still the semantic center of the library
arbitrary encodings live at ingress and egress only

For the exact public API surface, see Boundary Encodings.

What This Layer Gives You¶

built-in codecs for common boundary formats
a way to define your own Encoder and Decoder types without specializing library traits
generated from_encoded(...), to_encoded(...), encode_to(...), and encode_append_to(...) entry points on the owning UTF string types
a uniform error model for codec failures versus bounded-output overflow
preserved UTF invariants on the library side

Built-in Codecs¶

The library currently ships these boundary codecs:

Codec	Direction	Behavior
`encodings::ascii_strict`	encode + decode	strict ASCII only; non-ASCII input is an ordinary codec error
`encodings::ascii_lossy`	encode + decode	replaces unrepresentable scalars and invalid bytes; tracks replacement count on the codec object
`encodings::iso_8859_1`	encode + decode	strict ISO-8859-1 / Latin-1 mapping; encode is fallible, decode is total
`encodings::iso_8859_15`	encode + decode	strict ISO-8859-15 / Latin-9 mapping; encode is fallible, decode is total
`encodings::windows_1251`	encode + decode	strict WHATWG Windows-1251 mapping; encode is fallible, decode is total
`encodings::windows_1252`	encode + decode	strict WHATWG Windows-1252 mapping; encode is fallible, decode is total

Use built-ins when they already match your boundary contract. Define custom codecs when you need different replacement behavior, diagnostics, protocol rules, or a different encoding altogether.

The built-in single-byte codecs use documented source mappings. iso_8859_1 is the direct Latin-1 identity mapping. iso_8859_15, windows_1251, and windows_1252 follow the corresponding WHATWG index files.

Defining Your Own Codecs¶

You define real codec objects. The library detects their hooks through encoder_traits and decoder_traits, but users do not specialize those traits directly.

That means:

codec state lives on the object itself
stateless codecs can stay empty and cheap
stateful codecs can keep counters, buffers, runtime configuration, or diagnostics on the object

Encoder Requirements¶

A custom encoder must define:

using code_unit_type = ...;
encode_one(char32_t scalar, Writer out)

Optional encoder additions are:

using encode_error = ...;
static constexpr bool allow_implicit_construction = true;
flush(Writer out)
encode_from_utf8(...)
encode_from_utf16(...)
encode_from_utf32(...)

Semantic requirements:

code_unit_type must be usable with std::basic_string_view<code_unit_type>
if the encoder participates in to_encoded(...), code_unit_type must also be usable with std::basic_string<code_unit_type, std::char_traits<code_unit_type>, ...>
encode_one(...) receives one Unicode scalar and writes encoded code units through the writer handle
whole-input hooks such as encode_from_utf8(...) must consume the entire input view on success
the library still calls flush(...) after a successful whole-input hook

Decoder Requirements¶

A custom decoder must define:

using code_unit_type = ...;
decode_one(std::basic_string_view<code_unit_type> input, Writer out)

Optional decoder additions are:

using decode_error = ...;
static constexpr bool allow_implicit_construction = true;
flush(Writer out)
decode_to_utf8(...)
decode_to_utf16(...)
decode_to_utf32(...)

Semantic requirements:

code_unit_type must be usable with std::basic_string_view<code_unit_type>
decode_one(...) receives the remaining suffix of the original encoded input after all previously reported consumption has been removed
the success return value is the number of code units consumed from that suffix
successful decode_one(...) calls must consume at least one code unit and no more than input.size()
once the input is exhausted, the library stops calling decode_one(...) and proceeds to flush(...)
flush(...) must be callable even if no prior decode_one(...) call happened, which naturally occurs for empty input
whole-input hooks such as decode_to_utf8(...) must consume the entire input view on success

Fallible Versus Infallible Codecs¶

Whether a codec is fallible is declared by the presence of the error alias:

define encode_error to make the encoder fallible
define decode_error to make the decoder fallible

Once the alias exists, the matching hooks must switch to std::expected return types:

encode_one(...)
decode_one(...)
flush(...)
any whole-input bulk hook you provide

Additional rules:

encode_error and decode_error must be non-void types
if encode_error exists, encode_one(...), flush(...), and any encode_from_utf* hook must return std::expected<..., encode_error>
if decode_error exists, decode_one(...), flush(...), and any decode_to_utf* hook must return std::expected<..., decode_error>
using encode_error = void; and using decode_error = void; are not valid; omit the alias entirely for infallible codecs

If the alias is absent, those hooks use the direct success form instead.

Example:

// Infallible encoder: no encode_error alias, plain success returns.
struct ascii_encoder {
    using code_unit_type = char8_t;

    template <typename Writer>
    void encode_one(char32_t scalar, Writer out);

    template <typename Writer>
    void flush(Writer out);
};

// Fallible encoder: encode_error exists, matching hooks return expected.
struct strict_legacy_encoder {
    using code_unit_type = char8_t;

    enum class encode_error {
        unrepresentable_scalar
    };

    template <typename Writer>
    std::expected<void, encode_error> encode_one(char32_t scalar, Writer out);

    template <typename Writer>
    std::expected<void, encode_error> flush(Writer out);
};

Implicit Construction¶

The convenience overloads that do not take an explicit codec object are controlled by allow_implicit_construction.

Rules:

if allow_implicit_construction is omitted, an empty default-constructible codec may still be treated as implicitly constructible
explicit false opts out even for empty default-constructible codecs
explicit true opts in even for non-empty codecs
if allow_implicit_construction is true but the codec is not default-constructible, the convenience overloads fail with a static assertion because the library must default-construct a temporary codec object internally

This matters because the generated no-object APIs internally create a temporary codec object and do not return it to you afterwards.

That is appropriate for stateless codecs. It is usually the wrong choice for stateful codecs whose counters or diagnostics matter after the operation.

The Writer Contract¶

Codec hooks do not write directly to arbitrary iterators or containers. They receive a small writer handle instead.

Important properties:

writers are passed by value
copying a writer copies only the handle, not the destination
writer copies still talk to the same underlying sink
writers are call-scoped and should not be retained by codec objects

Writers provide three operations:

reserve(additional_units)
push(unit)
append(units)

The same writer model is used for:

encoded code units on the encode side
Unicode scalar output on the primitive decode side
UTF code-unit output inside the library's bulk decode paths

Generated Owning-String APIs¶

After you define a decoder, the owning UTF string types gain these entry points:

utf8_string::from_encoded<Decoder>(...)
utf16_string::from_encoded<Decoder>(...)
utf32_string::from_encoded<Decoder>(...)

After you define an encoder, the owning UTF string types gain these entry points:

text.to_encoded<Encoder>(...)
text.encode_to<Encoder>(...)
text.encode_append_to<Encoder>(...)

In practice that means:

decode always materializes an owned validated UTF string
to_encoded(...) builds an owned encoded string
encode_to(...) targets a bounded raw output range and reports overflow explicitly
encode_append_to(...) appends to an existing growable sequence-like container and never reports overflow

This example shows a custom strict ASCII encoder/decoder pair and the generated methods it unlocks:

#include <array>
#include <cstdint>
#include <vector>

#include "unicode_ranges_all.hpp"

using namespace unicode_ranges;

struct strict_ascii_encoder
{
    using code_unit_type = char8_t;

    enum class encode_error
    {
        unrepresentable_scalar
    };

    static constexpr bool allow_implicit_construction = true;

    template <typename Writer>
    constexpr auto encode_one(char32_t scalar, Writer out) -> std::expected<void, encode_error>
    {
        if (scalar > 0x7Fu)
        {
            return std::unexpected(encode_error::unrepresentable_scalar);
        }

        out.push(static_cast<char8_t>(scalar));
        return {};
    }
};

struct strict_ascii_decoder
{
    using code_unit_type = char8_t;

    enum class decode_error
    {
        invalid_input
    };

    static constexpr bool allow_implicit_construction = true;

    template <typename Writer>
    constexpr auto decode_one(std::basic_string_view<char8_t> input, Writer out)
        -> std::expected<std::size_t, decode_error>
    {
        const auto byte = static_cast<std::uint8_t>(input.front());
        if (byte > 0x7Fu)
        {
            return std::unexpected(decode_error::invalid_input);
        }

        out.push(static_cast<char32_t>(byte));
        return 1;
    }
};

int main()
{
    auto decoded = utf8_string::from_encoded<strict_ascii_decoder>(u8"Hello");
    if (!decoded || decoded->base() != u8"Hello")
    {
        return 1;
    }

    auto owned_bytes = decoded->to_encoded<strict_ascii_encoder>();
    if (!owned_bytes || *owned_bytes != u8"Hello")
    {
        return 1;
    }

    std::array<char8_t, 5> bounded{};
    auto bounded_result = decoded->encode_to<strict_ascii_encoder>(std::span<char8_t>{ bounded });
    if (!bounded_result || std::u8string_view{ bounded.data(), bounded.size() } != u8"Hello")
    {
        return 1;
    }

    std::vector<char8_t> appended{ static_cast<char8_t>('>') };
    auto append_result = decoded->encode_append_to<strict_ascii_encoder>(appended);
    if (!append_result || appended != std::vector<char8_t>{
        static_cast<char8_t>('>'),
        static_cast<char8_t>('H'),
        static_cast<char8_t>('e'),
        static_cast<char8_t>('l'),
        static_cast<char8_t>('l'),
        static_cast<char8_t>('o') })
    {
        return 1;
    }

    return 0;
}

Error Handling And Guarantees¶

Boundary encoding has two separate failure classes:

codec-defined failures
bounded-output overflow on the raw encode_to(...) path

Codec Failures¶

Codec failures belong to the codec object itself:

strict rejection of unrepresentable scalars
invalid input sequences
protocol-specific decode failures

These surface through encode_error or decode_error.

Return-type rules:

from_encoded(...) returns the UTF value directly for infallible decoders, or std::expected<UTF, decode_error> for fallible decoders
to_encoded(...) returns the encoded string directly for infallible encoders, or std::expected<string, encode_error> for fallible encoders
encode_append_to(...) returns void for infallible encoders, or std::expected<void, encode_error> for fallible encoders

Bounded Output Overflow¶

Overflow is not a codec error.

It belongs to the library-owned bounded writer used by encode_to(...).

That path always reports:

std::expected<void, encode_to_error<Encoder>>

with:

encode_to_error_kind::overflow for destination exhaustion
encode_to_error_kind::encoding_error for codec-defined encode failures

Partial Output Preservation¶

When output has already been written, it is preserved:

on bounded overflow
on codec failure after writing a prefix
on growable-container writes before a container exception escapes

The library does not roll already-written output back.

Decoder Validation Guarantee¶

Custom decoders are allowed to write through bulk UTF paths such as decode_to_utf8(...), but the library still validates the final UTF result before constructing the public owning UTF type.

That keeps the invariant intact:

successful from_encoded(...) still returns validated UTF

Contract Violations¶

The codec hooks are a contract.

Examples of contract violations:

reporting success from decode_one(...) with 0 consumed input
emitting an invalid Unicode scalar
writing malformed UTF through a bulk UTF hook

When UTF8_RANGES_ENABLE_CODEC_CONTRACT_CHECKS is enabled:

exception-enabled builds throw codec_contract_violation
no-exception builds terminate

When contract checks are disabled, violating the codec contract is undefined behavior.

Native Codecs In Use¶

The built-ins are meant to demonstrate the supported codec shapes as well as provide immediate value.

This example uses all currently supported native codecs:

#include <array>
#include <vector>

#include "unicode_ranges_all.hpp"

using namespace unicode_ranges;
using namespace unicode_ranges::literals;

int main()
{
    const std::array<char8_t, 8> windows_bytes{
        static_cast<char8_t>('P'),
        static_cast<char8_t>('r'),
        static_cast<char8_t>('i'),
        static_cast<char8_t>('c'),
        static_cast<char8_t>('e'),
        static_cast<char8_t>(':'),
        static_cast<char8_t>(' '),
        static_cast<char8_t>(0x80u)
    };

    const auto windows_text = utf8_string::from_encoded<encodings::windows_1252>(
        std::u8string_view{ windows_bytes.data(), windows_bytes.size() });
    if (windows_text.base() != u8"Price: \u20AC")
    {
        return 1;
    }

    const auto windows_round_trip = windows_text.to_encoded<encodings::windows_1252>();
    if (!windows_round_trip || *windows_round_trip != std::u8string{ windows_bytes.begin(), windows_bytes.end() })
    {
        return 1;
    }

    const std::array<char8_t, 2> latin1_bytes{
        static_cast<char8_t>('C'),
        static_cast<char8_t>(0xE9u)
    };
    const auto latin1_text = utf8_string::from_encoded<encodings::iso_8859_1>(
        std::u8string_view{ latin1_bytes.data(), latin1_bytes.size() });
    if (latin1_text.base() != u8"C\u00E9")
    {
        return 1;
    }

    const auto latin1_round_trip = latin1_text.to_encoded<encodings::iso_8859_1>();
    if (!latin1_round_trip || *latin1_round_trip != std::u8string{ latin1_bytes.begin(), latin1_bytes.end() })
    {
        return 1;
    }

    const std::array<char8_t, 4> latin9_bytes{
        static_cast<char8_t>(0xA4u),
        static_cast<char8_t>(0xBCu),
        static_cast<char8_t>(0xBDu),
        static_cast<char8_t>(0xBEu)
    };
    const auto latin9_text = utf8_string::from_encoded<encodings::iso_8859_15>(
        std::u8string_view{ latin9_bytes.data(), latin9_bytes.size() });
    if (latin9_text.base() != u8"\u20AC\u0152\u0153\u0178")
    {
        return 1;
    }

    const auto latin9_round_trip = latin9_text.to_encoded<encodings::iso_8859_15>();
    if (!latin9_round_trip || *latin9_round_trip != std::u8string{ latin9_bytes.begin(), latin9_bytes.end() })
    {
        return 1;
    }

    const std::array<char8_t, 6> windows_1251_bytes{
        static_cast<char8_t>(0xCFu),
        static_cast<char8_t>(0xF0u),
        static_cast<char8_t>(0xE8u),
        static_cast<char8_t>(0xE2u),
        static_cast<char8_t>(0xE5u),
        static_cast<char8_t>(0xF2u)
    };
    const auto windows_1251_text = utf8_string::from_encoded<encodings::windows_1251>(
        std::u8string_view{ windows_1251_bytes.data(), windows_1251_bytes.size() });
    if (windows_1251_text.base() != u8"\u041F\u0440\u0438\u0432\u0435\u0442")
    {
        return 1;
    }

    const auto windows_1251_round_trip = windows_1251_text.to_encoded<encodings::windows_1251>();
    if (!windows_1251_round_trip || *windows_1251_round_trip != std::u8string{ windows_1251_bytes.begin(), windows_1251_bytes.end() })
    {
        return 1;
    }

    const auto strict_ascii = utf8_string::from_encoded<encodings::ascii_strict>(u8"Hello");
    if (!strict_ascii || strict_ascii->base() != u8"Hello")
    {
        return 1;
    }

    encodings::ascii_lossy lossy{};
    std::vector<char8_t> lossy_bytes{};
    u8"Caf\u00E9"_utf8_sv.to_utf8_owned().encode_append_to(lossy_bytes, lossy);
    if (lossy.replacement_count != 1 || lossy_bytes != std::vector<char8_t>{
        static_cast<char8_t>('C'),
        static_cast<char8_t>('a'),
        static_cast<char8_t>('f'),
        static_cast<char8_t>('?') })
    {
        return 1;
    }

    return 0;
}

Stateful Codecs¶

A stateful codec is just a non-empty codec object whose state changes across calls.

Typical uses:

replacement counters
warnings and diagnostics
buffered partial sequences
runtime configuration

Because state lives on the object itself, stateful codecs usually should not opt into implicit construction. You normally want to inspect the same object after the operation finishes.

This example shows a small stateful lossy encoder:

#include <vector>

#include "unicode_ranges_all.hpp"

using namespace unicode_ranges;
using namespace unicode_ranges::literals;

struct counting_ascii_lossy_encoder
{
    using code_unit_type = char8_t;

    static constexpr bool allow_implicit_construction = false;

    std::size_t replacement_count = 0;

    template <typename Writer>
    constexpr void encode_one(char32_t scalar, Writer out)
    {
        if (scalar <= 0x7Fu)
        {
            out.push(static_cast<char8_t>(scalar));
            return;
        }

        out.push(static_cast<char8_t>('?'));
        ++replacement_count;
    }
};

int main()
{
    counting_ascii_lossy_encoder encoder{};
    std::vector<char8_t> bytes{};
    u8"Caf\u00E9"_utf8_sv.to_utf8_owned().encode_append_to(bytes, encoder);

    return encoder.replacement_count == 1
        && bytes == std::vector<char8_t>{
            static_cast<char8_t>('C'),
            static_cast<char8_t>('a'),
            static_cast<char8_t>('f'),
            static_cast<char8_t>('?') }
        ? 0
        : 1;
}

Where To Go Next¶

Boundary Encodings for the exact overload sets and type aliases
Owning Strings for the UTF types these APIs attach to
Common Tasks for higher-level Unicode operations after text is already validated