Skip to content

Boundary Encodings

unicode_ranges stays UTF-centric internally, but it can now encode to and decode from external non-UTF formats at the boundary.

This is the layer to use when text must cross an interface that speaks something other than validated UTF-8 / UTF-16 / UTF-32:

  • legacy single-byte encodings
  • protocol-specific byte formats
  • application-defined encodings
  • bounded output buffers
  • growable byte containers

The core rule stays the same:

  • validated UTF types are still the semantic center of the library
  • arbitrary encodings live at ingress and egress only

For the exact public API surface, see Boundary Encodings.

What This Layer Gives You

  • built-in codecs for common boundary formats
  • a way to define your own Encoder and Decoder types without specializing library traits
  • generated from_encoded(...), to_encoded(...), encode_to(...), and encode_append_to(...) entry points on the owning UTF string types
  • a uniform error model for codec failures versus bounded-output overflow
  • preserved UTF invariants on the library side

Built-in Codecs

The library currently ships these boundary codecs:

Codec Direction Behavior
encodings::ascii_strict encode + decode strict ASCII only; non-ASCII input is an ordinary codec error
encodings::ascii_lossy encode + decode replaces unrepresentable scalars and invalid bytes; tracks replacement count on the codec object
encodings::iso_8859_1 encode + decode strict ISO-8859-1 / Latin-1 mapping; encode is fallible, decode is total
encodings::iso_8859_15 encode + decode strict ISO-8859-15 / Latin-9 mapping; encode is fallible, decode is total
encodings::windows_1251 encode + decode strict WHATWG Windows-1251 mapping; encode is fallible, decode is total
encodings::windows_1252 encode + decode strict WHATWG Windows-1252 mapping; encode is fallible, decode is total

Use built-ins when they already match your boundary contract. Define custom codecs when you need different replacement behavior, diagnostics, protocol rules, or a different encoding altogether.

The built-in single-byte codecs use documented source mappings. iso_8859_1 is the direct Latin-1 identity mapping. iso_8859_15, windows_1251, and windows_1252 follow the corresponding WHATWG index files.

Defining Your Own Codecs

You define real codec objects. The library detects their hooks through encoder_traits and decoder_traits, but users do not specialize those traits directly.

That means:

  • codec state lives on the object itself
  • stateless codecs can stay empty and cheap
  • stateful codecs can keep counters, buffers, runtime configuration, or diagnostics on the object

Encoder Requirements

A custom encoder must define:

  • using code_unit_type = ...;
  • encode_one(char32_t scalar, Writer out)

Optional encoder additions are:

  • using encode_error = ...;
  • static constexpr bool allow_implicit_construction = true;
  • flush(Writer out)
  • encode_from_utf8(...)
  • encode_from_utf16(...)
  • encode_from_utf32(...)

Semantic requirements:

  • code_unit_type must be usable with std::basic_string_view<code_unit_type>
  • if the encoder participates in to_encoded(...), code_unit_type must also be usable with std::basic_string<code_unit_type, std::char_traits<code_unit_type>, ...>
  • encode_one(...) receives one Unicode scalar and writes encoded code units through the writer handle
  • whole-input hooks such as encode_from_utf8(...) must consume the entire input view on success
  • the library still calls flush(...) after a successful whole-input hook

Decoder Requirements

A custom decoder must define:

  • using code_unit_type = ...;
  • decode_one(std::basic_string_view<code_unit_type> input, Writer out)

Optional decoder additions are:

  • using decode_error = ...;
  • static constexpr bool allow_implicit_construction = true;
  • flush(Writer out)
  • decode_to_utf8(...)
  • decode_to_utf16(...)
  • decode_to_utf32(...)

Semantic requirements:

  • code_unit_type must be usable with std::basic_string_view<code_unit_type>
  • decode_one(...) receives the remaining suffix of the original encoded input after all previously reported consumption has been removed
  • the success return value is the number of code units consumed from that suffix
  • successful decode_one(...) calls must consume at least one code unit and no more than input.size()
  • once the input is exhausted, the library stops calling decode_one(...) and proceeds to flush(...)
  • flush(...) must be callable even if no prior decode_one(...) call happened, which naturally occurs for empty input
  • whole-input hooks such as decode_to_utf8(...) must consume the entire input view on success

Fallible Versus Infallible Codecs

Whether a codec is fallible is declared by the presence of the error alias:

  • define encode_error to make the encoder fallible
  • define decode_error to make the decoder fallible

Once the alias exists, the matching hooks must switch to std::expected return types:

  • encode_one(...)
  • decode_one(...)
  • flush(...)
  • any whole-input bulk hook you provide

Additional rules:

  • encode_error and decode_error must be non-void types
  • if encode_error exists, encode_one(...), flush(...), and any encode_from_utf* hook must return std::expected<..., encode_error>
  • if decode_error exists, decode_one(...), flush(...), and any decode_to_utf* hook must return std::expected<..., decode_error>
  • using encode_error = void; and using decode_error = void; are not valid; omit the alias entirely for infallible codecs

If the alias is absent, those hooks use the direct success form instead.

Example:

// Infallible encoder: no encode_error alias, plain success returns.
struct ascii_encoder {
    using code_unit_type = char8_t;

    template <typename Writer>
    void encode_one(char32_t scalar, Writer out);

    template <typename Writer>
    void flush(Writer out);
};

// Fallible encoder: encode_error exists, matching hooks return expected.
struct strict_legacy_encoder {
    using code_unit_type = char8_t;

    enum class encode_error {
        unrepresentable_scalar
    };

    template <typename Writer>
    std::expected<void, encode_error> encode_one(char32_t scalar, Writer out);

    template <typename Writer>
    std::expected<void, encode_error> flush(Writer out);
};

Implicit Construction

The convenience overloads that do not take an explicit codec object are controlled by allow_implicit_construction.

Rules:

  • if allow_implicit_construction is omitted, an empty default-constructible codec may still be treated as implicitly constructible
  • explicit false opts out even for empty default-constructible codecs
  • explicit true opts in even for non-empty codecs
  • if allow_implicit_construction is true but the codec is not default-constructible, the convenience overloads fail with a static assertion because the library must default-construct a temporary codec object internally

This matters because the generated no-object APIs internally create a temporary codec object and do not return it to you afterwards.

That is appropriate for stateless codecs. It is usually the wrong choice for stateful codecs whose counters or diagnostics matter after the operation.

The Writer Contract

Codec hooks do not write directly to arbitrary iterators or containers. They receive a small writer handle instead.

Important properties:

  • writers are passed by value
  • copying a writer copies only the handle, not the destination
  • writer copies still talk to the same underlying sink
  • writers are call-scoped and should not be retained by codec objects

Writers provide three operations:

  • reserve(additional_units)
  • push(unit)
  • append(units)

The same writer model is used for:

  • encoded code units on the encode side
  • Unicode scalar output on the primitive decode side
  • UTF code-unit output inside the library's bulk decode paths

Generated Owning-String APIs

After you define a decoder, the owning UTF string types gain these entry points:

  • utf8_string::from_encoded<Decoder>(...)
  • utf16_string::from_encoded<Decoder>(...)
  • utf32_string::from_encoded<Decoder>(...)

After you define an encoder, the owning UTF string types gain these entry points:

  • text.to_encoded<Encoder>(...)
  • text.encode_to<Encoder>(...)
  • text.encode_append_to<Encoder>(...)

In practice that means:

  • decode always materializes an owned validated UTF string
  • to_encoded(...) builds an owned encoded string
  • encode_to(...) targets a bounded raw output range and reports overflow explicitly
  • encode_append_to(...) appends to an existing growable sequence-like container and never reports overflow

This example shows a custom strict ASCII encoder/decoder pair and the generated methods it unlocks:

#include <array>
#include <cstdint>
#include <vector>

#include "unicode_ranges_all.hpp"

using namespace unicode_ranges;

struct strict_ascii_encoder
{
    using code_unit_type = char8_t;

    enum class encode_error
    {
        unrepresentable_scalar
    };

    static constexpr bool allow_implicit_construction = true;

    template <typename Writer>
    constexpr auto encode_one(char32_t scalar, Writer out) -> std::expected<void, encode_error>
    {
        if (scalar > 0x7Fu)
        {
            return std::unexpected(encode_error::unrepresentable_scalar);
        }

        out.push(static_cast<char8_t>(scalar));
        return {};
    }
};

struct strict_ascii_decoder
{
    using code_unit_type = char8_t;

    enum class decode_error
    {
        invalid_input
    };

    static constexpr bool allow_implicit_construction = true;

    template <typename Writer>
    constexpr auto decode_one(std::basic_string_view<char8_t> input, Writer out)
        -> std::expected<std::size_t, decode_error>
    {
        const auto byte = static_cast<std::uint8_t>(input.front());
        if (byte > 0x7Fu)
        {
            return std::unexpected(decode_error::invalid_input);
        }

        out.push(static_cast<char32_t>(byte));
        return 1;
    }
};

int main()
{
    auto decoded = utf8_string::from_encoded<strict_ascii_decoder>(u8"Hello");
    if (!decoded || decoded->base() != u8"Hello")
    {
        return 1;
    }

    auto owned_bytes = decoded->to_encoded<strict_ascii_encoder>();
    if (!owned_bytes || *owned_bytes != u8"Hello")
    {
        return 1;
    }

    std::array<char8_t, 5> bounded{};
    auto bounded_result = decoded->encode_to<strict_ascii_encoder>(std::span<char8_t>{ bounded });
    if (!bounded_result || std::u8string_view{ bounded.data(), bounded.size() } != u8"Hello")
    {
        return 1;
    }

    std::vector<char8_t> appended{ static_cast<char8_t>('>') };
    auto append_result = decoded->encode_append_to<strict_ascii_encoder>(appended);
    if (!append_result || appended != std::vector<char8_t>{
        static_cast<char8_t>('>'),
        static_cast<char8_t>('H'),
        static_cast<char8_t>('e'),
        static_cast<char8_t>('l'),
        static_cast<char8_t>('l'),
        static_cast<char8_t>('o') })
    {
        return 1;
    }

    return 0;
}

Error Handling And Guarantees

Boundary encoding has two separate failure classes:

  • codec-defined failures
  • bounded-output overflow on the raw encode_to(...) path

Codec Failures

Codec failures belong to the codec object itself:

  • strict rejection of unrepresentable scalars
  • invalid input sequences
  • protocol-specific decode failures

These surface through encode_error or decode_error.

Return-type rules:

  • from_encoded(...) returns the UTF value directly for infallible decoders, or std::expected<UTF, decode_error> for fallible decoders
  • to_encoded(...) returns the encoded string directly for infallible encoders, or std::expected<string, encode_error> for fallible encoders
  • encode_append_to(...) returns void for infallible encoders, or std::expected<void, encode_error> for fallible encoders

Bounded Output Overflow

Overflow is not a codec error.

It belongs to the library-owned bounded writer used by encode_to(...).

That path always reports:

  • std::expected<void, encode_to_error<Encoder>>

with:

  • encode_to_error_kind::overflow for destination exhaustion
  • encode_to_error_kind::encoding_error for codec-defined encode failures

Partial Output Preservation

When output has already been written, it is preserved:

  • on bounded overflow
  • on codec failure after writing a prefix
  • on growable-container writes before a container exception escapes

The library does not roll already-written output back.

Decoder Validation Guarantee

Custom decoders are allowed to write through bulk UTF paths such as decode_to_utf8(...), but the library still validates the final UTF result before constructing the public owning UTF type.

That keeps the invariant intact:

  • successful from_encoded(...) still returns validated UTF

Contract Violations

The codec hooks are a contract.

Examples of contract violations:

  • reporting success from decode_one(...) with 0 consumed input
  • emitting an invalid Unicode scalar
  • writing malformed UTF through a bulk UTF hook

When UTF8_RANGES_ENABLE_CODEC_CONTRACT_CHECKS is enabled:

  • exception-enabled builds throw codec_contract_violation
  • no-exception builds terminate

When contract checks are disabled, violating the codec contract is undefined behavior.

Native Codecs In Use

The built-ins are meant to demonstrate the supported codec shapes as well as provide immediate value.

This example uses all currently supported native codecs:

#include <array>
#include <vector>

#include "unicode_ranges_all.hpp"

using namespace unicode_ranges;
using namespace unicode_ranges::literals;

int main()
{
    const std::array<char8_t, 8> windows_bytes{
        static_cast<char8_t>('P'),
        static_cast<char8_t>('r'),
        static_cast<char8_t>('i'),
        static_cast<char8_t>('c'),
        static_cast<char8_t>('e'),
        static_cast<char8_t>(':'),
        static_cast<char8_t>(' '),
        static_cast<char8_t>(0x80u)
    };

    const auto windows_text = utf8_string::from_encoded<encodings::windows_1252>(
        std::u8string_view{ windows_bytes.data(), windows_bytes.size() });
    if (windows_text.base() != u8"Price: \u20AC")
    {
        return 1;
    }

    const auto windows_round_trip = windows_text.to_encoded<encodings::windows_1252>();
    if (!windows_round_trip || *windows_round_trip != std::u8string{ windows_bytes.begin(), windows_bytes.end() })
    {
        return 1;
    }

    const std::array<char8_t, 2> latin1_bytes{
        static_cast<char8_t>('C'),
        static_cast<char8_t>(0xE9u)
    };
    const auto latin1_text = utf8_string::from_encoded<encodings::iso_8859_1>(
        std::u8string_view{ latin1_bytes.data(), latin1_bytes.size() });
    if (latin1_text.base() != u8"C\u00E9")
    {
        return 1;
    }

    const auto latin1_round_trip = latin1_text.to_encoded<encodings::iso_8859_1>();
    if (!latin1_round_trip || *latin1_round_trip != std::u8string{ latin1_bytes.begin(), latin1_bytes.end() })
    {
        return 1;
    }

    const std::array<char8_t, 4> latin9_bytes{
        static_cast<char8_t>(0xA4u),
        static_cast<char8_t>(0xBCu),
        static_cast<char8_t>(0xBDu),
        static_cast<char8_t>(0xBEu)
    };
    const auto latin9_text = utf8_string::from_encoded<encodings::iso_8859_15>(
        std::u8string_view{ latin9_bytes.data(), latin9_bytes.size() });
    if (latin9_text.base() != u8"\u20AC\u0152\u0153\u0178")
    {
        return 1;
    }

    const auto latin9_round_trip = latin9_text.to_encoded<encodings::iso_8859_15>();
    if (!latin9_round_trip || *latin9_round_trip != std::u8string{ latin9_bytes.begin(), latin9_bytes.end() })
    {
        return 1;
    }

    const std::array<char8_t, 6> windows_1251_bytes{
        static_cast<char8_t>(0xCFu),
        static_cast<char8_t>(0xF0u),
        static_cast<char8_t>(0xE8u),
        static_cast<char8_t>(0xE2u),
        static_cast<char8_t>(0xE5u),
        static_cast<char8_t>(0xF2u)
    };
    const auto windows_1251_text = utf8_string::from_encoded<encodings::windows_1251>(
        std::u8string_view{ windows_1251_bytes.data(), windows_1251_bytes.size() });
    if (windows_1251_text.base() != u8"\u041F\u0440\u0438\u0432\u0435\u0442")
    {
        return 1;
    }

    const auto windows_1251_round_trip = windows_1251_text.to_encoded<encodings::windows_1251>();
    if (!windows_1251_round_trip || *windows_1251_round_trip != std::u8string{ windows_1251_bytes.begin(), windows_1251_bytes.end() })
    {
        return 1;
    }

    const auto strict_ascii = utf8_string::from_encoded<encodings::ascii_strict>(u8"Hello");
    if (!strict_ascii || strict_ascii->base() != u8"Hello")
    {
        return 1;
    }

    encodings::ascii_lossy lossy{};
    std::vector<char8_t> lossy_bytes{};
    u8"Caf\u00E9"_utf8_sv.to_utf8_owned().encode_append_to(lossy_bytes, lossy);
    if (lossy.replacement_count != 1 || lossy_bytes != std::vector<char8_t>{
        static_cast<char8_t>('C'),
        static_cast<char8_t>('a'),
        static_cast<char8_t>('f'),
        static_cast<char8_t>('?') })
    {
        return 1;
    }

    return 0;
}

Stateful Codecs

A stateful codec is just a non-empty codec object whose state changes across calls.

Typical uses:

  • replacement counters
  • warnings and diagnostics
  • buffered partial sequences
  • runtime configuration

Because state lives on the object itself, stateful codecs usually should not opt into implicit construction. You normally want to inspect the same object after the operation finishes.

This example shows a small stateful lossy encoder:

#include <vector>

#include "unicode_ranges_all.hpp"

using namespace unicode_ranges;
using namespace unicode_ranges::literals;

struct counting_ascii_lossy_encoder
{
    using code_unit_type = char8_t;

    static constexpr bool allow_implicit_construction = false;

    std::size_t replacement_count = 0;

    template <typename Writer>
    constexpr void encode_one(char32_t scalar, Writer out)
    {
        if (scalar <= 0x7Fu)
        {
            out.push(static_cast<char8_t>(scalar));
            return;
        }

        out.push(static_cast<char8_t>('?'));
        ++replacement_count;
    }
};

int main()
{
    counting_ascii_lossy_encoder encoder{};
    std::vector<char8_t> bytes{};
    u8"Caf\u00E9"_utf8_sv.to_utf8_owned().encode_append_to(bytes, encoder);

    return encoder.replacement_count == 1
        && bytes == std::vector<char8_t>{
            static_cast<char8_t>('C'),
            static_cast<char8_t>('a'),
            static_cast<char8_t>('f'),
            static_cast<char8_t>('?') }
        ? 0
        : 1;
}

Where To Go Next