Boundary Encodings¶
The boundary-encoding API extends the validated UTF string types at external encode and decode boundaries without changing the library's core model.
This page documents the exact public surface. For the higher-level guide to built-in codecs, custom codec requirements, guarantees, and error handling, see Boundary Encodings.
#include <cassert>
#include <array>
#include <vector>
#include "unicode_ranges_all.hpp"
using namespace unicode_ranges;
using namespace unicode_ranges::literals;
int main()
{
auto decoded = utf8_string::from_encoded<encodings::ascii_strict>(u8"Hello");
assert(decoded);
auto strict_bytes = decoded->to_encoded<encodings::ascii_strict>();
assert(strict_bytes);
assert(*strict_bytes == u8"Hello");
std::array<char8_t, 5> bounded{};
encodings::ascii_strict strict{};
auto wrote_bounded = decoded->encode_to(std::span<char8_t>{ bounded }, strict);
assert(wrote_bounded);
const std::u8string_view bounded_view{ bounded.data(), bounded.size() };
assert(bounded_view == u8"Hello");
const std::array<char8_t, 8> windows_input{
static_cast<char8_t>('P'),
static_cast<char8_t>('r'),
static_cast<char8_t>('i'),
static_cast<char8_t>('c'),
static_cast<char8_t>('e'),
static_cast<char8_t>(':'),
static_cast<char8_t>(' '),
static_cast<char8_t>(0x80u)
};
const auto windows = utf8_string::from_encoded<encodings::windows_1252>(
std::u8string_view{ windows_input.data(), windows_input.size() });
assert(windows.base() == u8"Price: \u20AC");
auto windows_encoded = windows.to_encoded<encodings::windows_1252>();
assert(windows_encoded);
const std::u8string expected_windows_bytes{ windows_input.begin(), windows_input.end() };
assert(*windows_encoded == expected_windows_bytes);
std::vector<char8_t> lossy_bytes{ static_cast<char8_t>('>') };
encodings::ascii_lossy lossy{};
u8"Café"_utf8_sv.to_utf8_owned().encode_append_to(lossy_bytes, lossy);
assert((lossy_bytes == std::vector<char8_t>{
static_cast<char8_t>('>'),
static_cast<char8_t>('C'),
static_cast<char8_t>('a'),
static_cast<char8_t>('f'),
static_cast<char8_t>('?') }));
assert(lossy.replacement_count == 1);
return 0;
}
Header And Namespaces¶
- Include
unicode_ranges_all.hppfor the full surface. - Include
unicode_ranges_borrowed.hppfor the lighter borrowed/core umbrella. - The boundary API lives in namespace
unicode_ranges. - Built-in codecs currently live in namespace
unicode_ranges::encodings.
Core Types¶
Synopsis¶
template <typename Encoder>
struct encoder_traits;
template <typename Decoder>
struct decoder_traits;
template <typename T>
concept encoder = /* exposition only */;
template <typename T>
concept decoder = /* exposition only */;
template <typename Decoder, typename UtfString>
using from_encoded_result = /* UtfString or std::expected<UtfString, decode_error> */;
template <typename Encoder, typename OutputAllocator = std::allocator<typename encoder_traits<Encoder>::code_unit_type>>
using to_encoded_result = /* string or std::expected<string, encode_error> */;
enum class encode_to_error_kind {
overflow,
encoding_error
};
template <typename Encoder>
struct encode_to_error;
struct codec_contract_violation : std::logic_error {};
Behavior¶
encoder_traitsanddecoder_traitsnormalize codec objects into the surface the library actually calls.- The traits layer always provides
flush(...)even when the codec object does not define it. - Decoder
code_unit_typemust be usable withstd::basic_string_view<code_unit_type>. to_encoded(...)additionally requires an encodedcode_unit_typethat can backstd::basic_string<code_unit_type, std::char_traits<code_unit_type>, ...>.from_encoded_resultandto_encoded_resultfollow the optional-error-alias rule:- no
decode_error/encode_erroralias means a direct UTF value or direct encoded string - defining the alias switches the corresponding family to
std::expected encode_to_error_kind::overflowis library-owned bounded-sink exhaustionencode_to_error_kind::encoding_errorwraps the codec'sencode_errorcodec_contract_violationis reserved for codec bugs when contract checks are enabled
Contract checks¶
UTF8_RANGES_ENABLE_CODEC_CONTRACT_CHECKS defaults to:
1in debug builds0in release builds
When enabled, contract violations throw codec_contract_violation in exception-enabled builds and terminate in no-exception builds. When disabled, violating the codec contract is undefined behavior.
Codec Objects¶
Minimum shape¶
struct my_encoder {
using code_unit_type = char8_t;
template <typename Writer>
void encode_one(char32_t scalar, Writer out);
};
struct my_decoder {
using code_unit_type = char8_t;
template <typename Writer>
std::size_t decode_one(std::basic_string_view<char8_t> input, Writer out);
};
Optional additions¶
using encode_error = /* ... */;
using decode_error = /* ... */;
static constexpr bool allow_implicit_construction = true;
template <typename Writer>
void flush(Writer out);
template <typename Writer>
void encode_from_utf8(utf8_string_view input, Writer out);
template <typename Writer>
void encode_from_utf16(utf16_string_view input, Writer out);
template <typename Writer>
void encode_from_utf32(utf32_string_view input, Writer out);
template <typename Writer>
void decode_to_utf8(std::basic_string_view<char8_t> input, Writer out);
template <typename Writer>
void decode_to_utf16(std::basic_string_view<char8_t> input, Writer out);
template <typename Writer>
void decode_to_utf32(std::basic_string_view<char8_t> input, Writer out);
Behavior¶
- Codec objects are real mutable objects. Any runtime state lives on the object itself.
- Writer parameters are taken by value. The writer is a cheap non-owning handle over external sink state.
- Decoder
code_unit_typetherefore has to be a validstd::basic_string_viewelement type. - Encoders intended for
to_encoded(...)must use acode_unit_typethat is also valid forstd::basic_stringwithstd::char_traits<code_unit_type>. - If
encode_errorordecode_erroris defined, that alias must be a non-voidtype. - If
encode_erroris defined,encode_one(...),flush(...), and anyencode_from_utf*hook the codec provides must returnstd::expected<..., encode_error>instead of the infallible form. - If
decode_erroris defined,decode_one(...),flush(...), and anydecode_to_utf*hook the codec provides must returnstd::expected<..., decode_error>instead of the infallible form. using encode_error = void;andusing decode_error = void;are not valid; omit the alias entirely for infallible codecs.allow_implicit_constructionis optional.- if omitted, empty default-constructible codecs are treated as implicitly constructible
- explicit
falseopts out - explicit
trueopts in even for non-empty codecs - If
allow_implicit_constructionistruebut the codec is not default-constructible, the no-object convenience overloads fail with a static assertion because the library must default-construct a temporary codec internally
Example:
// Infallible encoder: no encode_error alias, hooks return plain success values.
struct ascii_encoder {
using code_unit_type = char8_t;
template <typename Writer>
void encode_one(char32_t scalar, Writer out);
template <typename Writer>
void flush(Writer out);
};
// Fallible encoder: encode_error exists and the matching hooks switch to expected.
struct strict_legacy_encoder {
using code_unit_type = char8_t;
enum class encode_error {
unrepresentable_scalar
};
template <typename Writer>
std::expected<void, encode_error> encode_one(char32_t scalar, Writer out);
template <typename Writer>
std::expected<void, encode_error> flush(Writer out);
template <typename Writer>
std::expected<void, encode_error> encode_from_utf8(utf8_string_view input, Writer out);
};
Whole-input contract¶
encode_from_utf8(...),encode_from_utf16(...),encode_from_utf32(...),decode_to_utf8(...),decode_to_utf16(...), anddecode_to_utf32(...)are whole-input operations- on success they must consume the full input view they are given
- they cannot silently stop early
- the surrounding library algorithm still calls
flush(...)afterwards
Primitive decode contract¶
decode_one(...)receives the remaining suffix of the original input after previous successful consumption- the returned consumed count is relative to that suffix
- on success, consumed count must be non-zero and must not exceed
input.size() - once the input is exhausted, the library skips further
decode_one(...)calls and proceeds toflush(...) flush(...)must also be valid when no priordecode_one(...)call occurred, which naturally happens for empty input
Writer Surface¶
Codecs do not write directly to arbitrary containers or iterators. They receive a library writer handle with this logical contract:
struct Writer {
using unit_type = /* code unit or char32_t scalar, depending on context */;
void reserve(std::size_t additional_units) const;
void push(unit_type unit) const;
void append(std::span<const unit_type> units) const;
template <std::ranges::input_range R>
requires std::convertible_to<std::ranges::range_reference_t<R>, unit_type>
void append(R&& units) const;
};
Behavior¶
- Writer copies share the same underlying destination state.
- Writers are call-scoped handles and should not be retained by codecs.
- Raw bounded writers report overflow through
encode_to(...), not by throwing. - Growable container writers propagate ordinary container exceptions.
- For container appenders, the implementation prefers:
resize_and_overwrite(...)for suitable span or sized-range appendsappend_range(...)append(ptr, count)for string-like containersinsert_range(end(), ...)insert(end(), first, last)- repeated
push_back/emplace_back/insert(end(), value)withreserve(...)only on that repeated-push fallback
Owning String Boundary Functions¶
The UTF-8, UTF-16, and UTF-32 owning string types expose structurally parallel boundary APIs. The synopsis below uses the UTF-8 family explicitly.
Decode into validated UTF¶
template <typename Decoder>
static constexpr auto from_encoded(
std::basic_string_view<typename decoder_traits<Decoder>::code_unit_type> input,
Decoder& decoder,
const Allocator& alloc = Allocator())
-> from_encoded_result<Decoder, basic_utf8_string>;
template <typename Decoder>
requires decoder_traits<Decoder>::allow_implicit_construction_requested
static constexpr auto from_encoded(
std::basic_string_view<typename decoder_traits<Decoder>::code_unit_type> input,
const Allocator& alloc = Allocator())
-> from_encoded_result<Decoder, basic_utf8_string>;
Encode into an owned encoded string¶
template <typename Encoder, typename OutputAllocator = std::allocator<typename encoder_traits<Encoder>::code_unit_type>>
constexpr auto to_encoded(
Encoder& encoder,
const OutputAllocator& alloc = OutputAllocator()) const
-> to_encoded_result<Encoder, OutputAllocator>;
template <typename Encoder, typename OutputAllocator = std::allocator<typename encoder_traits<Encoder>::code_unit_type>>
requires encoder_traits<Encoder>::allow_implicit_construction_requested
constexpr auto to_encoded(
const OutputAllocator& alloc = OutputAllocator()) const
-> to_encoded_result<Encoder, OutputAllocator>;
Encode into a bounded raw sink¶
template <typename Encoder, typename Out>
requires std::ranges::range<Out>
&& std::ranges::output_range<Out, typename encoder_traits<Encoder>::code_unit_type>
constexpr auto encode_to(Out&& out, Encoder& encoder) const
-> std::expected<void, encode_to_error<Encoder>>;
template <typename Encoder, typename Out>
requires encoder_traits<Encoder>::allow_implicit_construction_requested
&& std::ranges::range<Out>
&& std::ranges::output_range<Out, typename encoder_traits<Encoder>::code_unit_type>
constexpr auto encode_to(Out&& out) const
-> std::expected<void, encode_to_error<Encoder>>;
Append to a growable sequence-like container¶
template <typename Encoder, typename Container>
constexpr auto encode_append_to(Container& container, Encoder& encoder) const
-> /* void or std::expected<void, encode_error> */;
template <typename Encoder, typename Container>
requires encoder_traits<Encoder>::allow_implicit_construction_requested
constexpr auto encode_append_to(Container& container) const
-> /* void or std::expected<void, encode_error> */;
Behavior¶
from_encoded(...)always materializes an owned validated UTF stringto_encoded(...)builds a growable encoded string resultencode_to(...)targets bounded raw sinks such as iterator/sentinel-backed outputs and reports overflow throughencode_to_error<Encoder>encode_append_to(...)appends after the destination container's existing contents and never reports overflowencode_append_to(...)only participates for sequence-like append containers whosevalue_typecan be constructed from the encoder'scode_unit_type- partial output written before overflow or codec failure is preserved
- if a growable destination container throws while appending, that exception propagates normally
Built-in Codecs¶
The built-in single-byte codecs follow documented source mappings rather than ad hoc byte tables. The current built-ins use either:
- direct identity mapping over
U+0000..U+00FF - or a published WHATWG index file
encodings::ascii_strict¶
code_unit_type = char8_t- defines both
encode_erroranddecode_error - encodes and decodes only ASCII
- reports non-ASCII scalars or bytes as ordinary codec errors
- enables implicit construction
encodings::ascii_lossy¶
code_unit_type = char8_t- does not define
encode_errorordecode_error - replaces unrepresentable scalars and invalid bytes with replacement output
- tracks replacement counts on the codec object
- does not opt into implicit construction, because callers typically care about the mutated codec object afterwards
encodings::iso_8859_1¶
code_unit_type = char8_t- defines
encode_error, but decoding is infallible - maps bytes
0x00..0xFFdirectly to UnicodeU+0000..U+00FF - encodes only scalars in the Latin-1 range and reports other scalars as ordinary encode errors
- enables implicit construction
Source mapping: - direct Latin-1 identity mapping
encodings::iso_8859_15¶
code_unit_type = char8_t- defines
encode_error, but decoding is infallible - follows the WHATWG ISO-8859-15 index
- keeps the Latin-1 shape, but remaps
0xA4,0xA6,0xA8,0xB4,0xB8,0xBC,0xBD, and0xBE - encodes only scalars in the ISO-8859-15 repertoire and reports other scalars as ordinary encode errors
- enables implicit construction
Source table: - WHATWG index: https://encoding.spec.whatwg.org/index-iso-8859-15.txt
encodings::windows_1251¶
code_unit_type = char8_t- defines
encode_error, but decoding is infallible - follows the WHATWG Windows-1251 index
- covers the Windows Cyrillic repertoire, including the WHATWG-preserved control and punctuation slots in the
0x80..0x9Frange - encodes ASCII and the Windows-1251 repertoire, and reports other scalars as ordinary encode errors
- enables implicit construction
Source table: - WHATWG index: https://encoding.spec.whatwg.org/index-windows-1251.txt
encodings::windows_1252¶
code_unit_type = char8_t- defines
encode_error, but decoding is infallible - follows the WHATWG Windows-1252 index, not the older undefined-hole vendor mapping
- encodes ASCII and the Windows-1252 repertoire, and reports other scalars as ordinary encode errors
- decodes bytes
0x81,0x8D,0x8F,0x90, and0x9Dto the corresponding C1 control code points, matching WHATWG - enables implicit construction
Source table: - WHATWG index: https://encoding.spec.whatwg.org/index-windows-1252.txt