Boundary Encodings¶
unicode_ranges stays UTF-centric internally, but it can now encode to and decode from external non-UTF formats at the boundary.
This is the layer to use when text must cross an interface that speaks something other than validated UTF-8 / UTF-16 / UTF-32:
- legacy single-byte encodings
- protocol-specific byte formats
- application-defined encodings
- bounded output buffers
- growable byte containers
The core rule stays the same:
- validated UTF types are still the semantic center of the library
- arbitrary encodings live at ingress and egress only
For the exact public API surface, see Boundary Encodings.
What This Layer Gives You¶
- built-in codecs for common boundary formats
- a way to define your own
EncoderandDecodertypes without specializing library traits - generated
from_encoded(...),to_encoded(...),encode_to(...), andencode_append_to(...)entry points on the owning UTF string types - a uniform error model for codec failures versus bounded-output overflow
- preserved UTF invariants on the library side
Built-in Codecs¶
The library currently ships these boundary codecs:
| Codec | Direction | Behavior |
|---|---|---|
encodings::ascii_strict |
encode + decode | strict ASCII only; non-ASCII input is an ordinary codec error |
encodings::ascii_lossy |
encode + decode | replaces unrepresentable scalars and invalid bytes; tracks replacement count on the codec object |
encodings::iso_8859_1 |
encode + decode | strict ISO-8859-1 / Latin-1 mapping; encode is fallible, decode is total |
encodings::iso_8859_15 |
encode + decode | strict ISO-8859-15 / Latin-9 mapping; encode is fallible, decode is total |
encodings::windows_1251 |
encode + decode | strict WHATWG Windows-1251 mapping; encode is fallible, decode is total |
encodings::windows_1252 |
encode + decode | strict WHATWG Windows-1252 mapping; encode is fallible, decode is total |
Use built-ins when they already match your boundary contract. Define custom codecs when you need different replacement behavior, diagnostics, protocol rules, or a different encoding altogether.
The built-in single-byte codecs use documented source mappings. iso_8859_1 is the direct Latin-1 identity mapping. iso_8859_15, windows_1251, and windows_1252 follow the corresponding WHATWG index files.
Defining Your Own Codecs¶
You define real codec objects. The library detects their hooks through encoder_traits and decoder_traits, but users do not specialize those traits directly.
That means:
- codec state lives on the object itself
- stateless codecs can stay empty and cheap
- stateful codecs can keep counters, buffers, runtime configuration, or diagnostics on the object
Encoder Requirements¶
A custom encoder must define:
using code_unit_type = ...;encode_one(char32_t scalar, Writer out)
Optional encoder additions are:
using encode_error = ...;static constexpr bool allow_implicit_construction = true;flush(Writer out)encode_from_utf8(...)encode_from_utf16(...)encode_from_utf32(...)
Semantic requirements:
code_unit_typemust be usable withstd::basic_string_view<code_unit_type>- if the encoder participates in
to_encoded(...),code_unit_typemust also be usable withstd::basic_string<code_unit_type, std::char_traits<code_unit_type>, ...> encode_one(...)receives one Unicode scalar and writes encoded code units through the writer handle- whole-input hooks such as
encode_from_utf8(...)must consume the entire input view on success - the library still calls
flush(...)after a successful whole-input hook
Decoder Requirements¶
A custom decoder must define:
using code_unit_type = ...;decode_one(std::basic_string_view<code_unit_type> input, Writer out)
Optional decoder additions are:
using decode_error = ...;static constexpr bool allow_implicit_construction = true;flush(Writer out)decode_to_utf8(...)decode_to_utf16(...)decode_to_utf32(...)
Semantic requirements:
code_unit_typemust be usable withstd::basic_string_view<code_unit_type>decode_one(...)receives the remaining suffix of the original encoded input after all previously reported consumption has been removed- the success return value is the number of code units consumed from that suffix
- successful
decode_one(...)calls must consume at least one code unit and no more thaninput.size() - once the input is exhausted, the library stops calling
decode_one(...)and proceeds toflush(...) flush(...)must be callable even if no priordecode_one(...)call happened, which naturally occurs for empty input- whole-input hooks such as
decode_to_utf8(...)must consume the entire input view on success
Fallible Versus Infallible Codecs¶
Whether a codec is fallible is declared by the presence of the error alias:
- define
encode_errorto make the encoder fallible - define
decode_errorto make the decoder fallible
Once the alias exists, the matching hooks must switch to std::expected return types:
encode_one(...)decode_one(...)flush(...)- any whole-input bulk hook you provide
Additional rules:
encode_erroranddecode_errormust be non-voidtypes- if
encode_errorexists,encode_one(...),flush(...), and anyencode_from_utf*hook must returnstd::expected<..., encode_error> - if
decode_errorexists,decode_one(...),flush(...), and anydecode_to_utf*hook must returnstd::expected<..., decode_error> using encode_error = void;andusing decode_error = void;are not valid; omit the alias entirely for infallible codecs
If the alias is absent, those hooks use the direct success form instead.
Example:
// Infallible encoder: no encode_error alias, plain success returns.
struct ascii_encoder {
using code_unit_type = char8_t;
template <typename Writer>
void encode_one(char32_t scalar, Writer out);
template <typename Writer>
void flush(Writer out);
};
// Fallible encoder: encode_error exists, matching hooks return expected.
struct strict_legacy_encoder {
using code_unit_type = char8_t;
enum class encode_error {
unrepresentable_scalar
};
template <typename Writer>
std::expected<void, encode_error> encode_one(char32_t scalar, Writer out);
template <typename Writer>
std::expected<void, encode_error> flush(Writer out);
};
Implicit Construction¶
The convenience overloads that do not take an explicit codec object are controlled by allow_implicit_construction.
Rules:
- if
allow_implicit_constructionis omitted, an empty default-constructible codec may still be treated as implicitly constructible - explicit
falseopts out even for empty default-constructible codecs - explicit
trueopts in even for non-empty codecs - if
allow_implicit_constructionistruebut the codec is not default-constructible, the convenience overloads fail with a static assertion because the library must default-construct a temporary codec object internally
This matters because the generated no-object APIs internally create a temporary codec object and do not return it to you afterwards.
That is appropriate for stateless codecs. It is usually the wrong choice for stateful codecs whose counters or diagnostics matter after the operation.
The Writer Contract¶
Codec hooks do not write directly to arbitrary iterators or containers. They receive a small writer handle instead.
Important properties:
- writers are passed by value
- copying a writer copies only the handle, not the destination
- writer copies still talk to the same underlying sink
- writers are call-scoped and should not be retained by codec objects
Writers provide three operations:
reserve(additional_units)push(unit)append(units)
The same writer model is used for:
- encoded code units on the encode side
- Unicode scalar output on the primitive decode side
- UTF code-unit output inside the library's bulk decode paths
Generated Owning-String APIs¶
After you define a decoder, the owning UTF string types gain these entry points:
utf8_string::from_encoded<Decoder>(...)utf16_string::from_encoded<Decoder>(...)utf32_string::from_encoded<Decoder>(...)
After you define an encoder, the owning UTF string types gain these entry points:
text.to_encoded<Encoder>(...)text.encode_to<Encoder>(...)text.encode_append_to<Encoder>(...)
In practice that means:
- decode always materializes an owned validated UTF string
to_encoded(...)builds an owned encoded stringencode_to(...)targets a bounded raw output range and reports overflow explicitlyencode_append_to(...)appends to an existing growable sequence-like container and never reports overflow
This example shows a custom strict ASCII encoder/decoder pair and the generated methods it unlocks:
#include <array>
#include <cstdint>
#include <vector>
#include "unicode_ranges_all.hpp"
using namespace unicode_ranges;
struct strict_ascii_encoder
{
using code_unit_type = char8_t;
enum class encode_error
{
unrepresentable_scalar
};
static constexpr bool allow_implicit_construction = true;
template <typename Writer>
constexpr auto encode_one(char32_t scalar, Writer out) -> std::expected<void, encode_error>
{
if (scalar > 0x7Fu)
{
return std::unexpected(encode_error::unrepresentable_scalar);
}
out.push(static_cast<char8_t>(scalar));
return {};
}
};
struct strict_ascii_decoder
{
using code_unit_type = char8_t;
enum class decode_error
{
invalid_input
};
static constexpr bool allow_implicit_construction = true;
template <typename Writer>
constexpr auto decode_one(std::basic_string_view<char8_t> input, Writer out)
-> std::expected<std::size_t, decode_error>
{
const auto byte = static_cast<std::uint8_t>(input.front());
if (byte > 0x7Fu)
{
return std::unexpected(decode_error::invalid_input);
}
out.push(static_cast<char32_t>(byte));
return 1;
}
};
int main()
{
auto decoded = utf8_string::from_encoded<strict_ascii_decoder>(u8"Hello");
if (!decoded || decoded->base() != u8"Hello")
{
return 1;
}
auto owned_bytes = decoded->to_encoded<strict_ascii_encoder>();
if (!owned_bytes || *owned_bytes != u8"Hello")
{
return 1;
}
std::array<char8_t, 5> bounded{};
auto bounded_result = decoded->encode_to<strict_ascii_encoder>(std::span<char8_t>{ bounded });
if (!bounded_result || std::u8string_view{ bounded.data(), bounded.size() } != u8"Hello")
{
return 1;
}
std::vector<char8_t> appended{ static_cast<char8_t>('>') };
auto append_result = decoded->encode_append_to<strict_ascii_encoder>(appended);
if (!append_result || appended != std::vector<char8_t>{
static_cast<char8_t>('>'),
static_cast<char8_t>('H'),
static_cast<char8_t>('e'),
static_cast<char8_t>('l'),
static_cast<char8_t>('l'),
static_cast<char8_t>('o') })
{
return 1;
}
return 0;
}
Error Handling And Guarantees¶
Boundary encoding has two separate failure classes:
- codec-defined failures
- bounded-output overflow on the raw
encode_to(...)path
Codec Failures¶
Codec failures belong to the codec object itself:
- strict rejection of unrepresentable scalars
- invalid input sequences
- protocol-specific decode failures
These surface through encode_error or decode_error.
Return-type rules:
from_encoded(...)returns the UTF value directly for infallible decoders, orstd::expected<UTF, decode_error>for fallible decodersto_encoded(...)returns the encoded string directly for infallible encoders, orstd::expected<string, encode_error>for fallible encodersencode_append_to(...)returnsvoidfor infallible encoders, orstd::expected<void, encode_error>for fallible encoders
Bounded Output Overflow¶
Overflow is not a codec error.
It belongs to the library-owned bounded writer used by encode_to(...).
That path always reports:
std::expected<void, encode_to_error<Encoder>>
with:
encode_to_error_kind::overflowfor destination exhaustionencode_to_error_kind::encoding_errorfor codec-defined encode failures
Partial Output Preservation¶
When output has already been written, it is preserved:
- on bounded overflow
- on codec failure after writing a prefix
- on growable-container writes before a container exception escapes
The library does not roll already-written output back.
Decoder Validation Guarantee¶
Custom decoders are allowed to write through bulk UTF paths such as decode_to_utf8(...), but the library still validates the final UTF result before constructing the public owning UTF type.
That keeps the invariant intact:
- successful
from_encoded(...)still returns validated UTF
Contract Violations¶
The codec hooks are a contract.
Examples of contract violations:
- reporting success from
decode_one(...)with0consumed input - emitting an invalid Unicode scalar
- writing malformed UTF through a bulk UTF hook
When UTF8_RANGES_ENABLE_CODEC_CONTRACT_CHECKS is enabled:
- exception-enabled builds throw
codec_contract_violation - no-exception builds terminate
When contract checks are disabled, violating the codec contract is undefined behavior.
Native Codecs In Use¶
The built-ins are meant to demonstrate the supported codec shapes as well as provide immediate value.
This example uses all currently supported native codecs:
#include <array>
#include <vector>
#include "unicode_ranges_all.hpp"
using namespace unicode_ranges;
using namespace unicode_ranges::literals;
int main()
{
const std::array<char8_t, 8> windows_bytes{
static_cast<char8_t>('P'),
static_cast<char8_t>('r'),
static_cast<char8_t>('i'),
static_cast<char8_t>('c'),
static_cast<char8_t>('e'),
static_cast<char8_t>(':'),
static_cast<char8_t>(' '),
static_cast<char8_t>(0x80u)
};
const auto windows_text = utf8_string::from_encoded<encodings::windows_1252>(
std::u8string_view{ windows_bytes.data(), windows_bytes.size() });
if (windows_text.base() != u8"Price: \u20AC")
{
return 1;
}
const auto windows_round_trip = windows_text.to_encoded<encodings::windows_1252>();
if (!windows_round_trip || *windows_round_trip != std::u8string{ windows_bytes.begin(), windows_bytes.end() })
{
return 1;
}
const std::array<char8_t, 2> latin1_bytes{
static_cast<char8_t>('C'),
static_cast<char8_t>(0xE9u)
};
const auto latin1_text = utf8_string::from_encoded<encodings::iso_8859_1>(
std::u8string_view{ latin1_bytes.data(), latin1_bytes.size() });
if (latin1_text.base() != u8"C\u00E9")
{
return 1;
}
const auto latin1_round_trip = latin1_text.to_encoded<encodings::iso_8859_1>();
if (!latin1_round_trip || *latin1_round_trip != std::u8string{ latin1_bytes.begin(), latin1_bytes.end() })
{
return 1;
}
const std::array<char8_t, 4> latin9_bytes{
static_cast<char8_t>(0xA4u),
static_cast<char8_t>(0xBCu),
static_cast<char8_t>(0xBDu),
static_cast<char8_t>(0xBEu)
};
const auto latin9_text = utf8_string::from_encoded<encodings::iso_8859_15>(
std::u8string_view{ latin9_bytes.data(), latin9_bytes.size() });
if (latin9_text.base() != u8"\u20AC\u0152\u0153\u0178")
{
return 1;
}
const auto latin9_round_trip = latin9_text.to_encoded<encodings::iso_8859_15>();
if (!latin9_round_trip || *latin9_round_trip != std::u8string{ latin9_bytes.begin(), latin9_bytes.end() })
{
return 1;
}
const std::array<char8_t, 6> windows_1251_bytes{
static_cast<char8_t>(0xCFu),
static_cast<char8_t>(0xF0u),
static_cast<char8_t>(0xE8u),
static_cast<char8_t>(0xE2u),
static_cast<char8_t>(0xE5u),
static_cast<char8_t>(0xF2u)
};
const auto windows_1251_text = utf8_string::from_encoded<encodings::windows_1251>(
std::u8string_view{ windows_1251_bytes.data(), windows_1251_bytes.size() });
if (windows_1251_text.base() != u8"\u041F\u0440\u0438\u0432\u0435\u0442")
{
return 1;
}
const auto windows_1251_round_trip = windows_1251_text.to_encoded<encodings::windows_1251>();
if (!windows_1251_round_trip || *windows_1251_round_trip != std::u8string{ windows_1251_bytes.begin(), windows_1251_bytes.end() })
{
return 1;
}
const auto strict_ascii = utf8_string::from_encoded<encodings::ascii_strict>(u8"Hello");
if (!strict_ascii || strict_ascii->base() != u8"Hello")
{
return 1;
}
encodings::ascii_lossy lossy{};
std::vector<char8_t> lossy_bytes{};
u8"Caf\u00E9"_utf8_sv.to_utf8_owned().encode_append_to(lossy_bytes, lossy);
if (lossy.replacement_count != 1 || lossy_bytes != std::vector<char8_t>{
static_cast<char8_t>('C'),
static_cast<char8_t>('a'),
static_cast<char8_t>('f'),
static_cast<char8_t>('?') })
{
return 1;
}
return 0;
}
Stateful Codecs¶
A stateful codec is just a non-empty codec object whose state changes across calls.
Typical uses:
- replacement counters
- warnings and diagnostics
- buffered partial sequences
- runtime configuration
Because state lives on the object itself, stateful codecs usually should not opt into implicit construction. You normally want to inspect the same object after the operation finishes.
This example shows a small stateful lossy encoder:
#include <vector>
#include "unicode_ranges_all.hpp"
using namespace unicode_ranges;
using namespace unicode_ranges::literals;
struct counting_ascii_lossy_encoder
{
using code_unit_type = char8_t;
static constexpr bool allow_implicit_construction = false;
std::size_t replacement_count = 0;
template <typename Writer>
constexpr void encode_one(char32_t scalar, Writer out)
{
if (scalar <= 0x7Fu)
{
out.push(static_cast<char8_t>(scalar));
return;
}
out.push(static_cast<char8_t>('?'));
++replacement_count;
}
};
int main()
{
counting_ascii_lossy_encoder encoder{};
std::vector<char8_t> bytes{};
u8"Caf\u00E9"_utf8_sv.to_utf8_owned().encode_append_to(bytes, encoder);
return encoder.replacement_count == 1
&& bytes == std::vector<char8_t>{
static_cast<char8_t>('C'),
static_cast<char8_t>('a'),
static_cast<char8_t>('f'),
static_cast<char8_t>('?') }
? 0
: 1;
}
Where To Go Next¶
- Boundary Encodings for the exact overload sets and type aliases
- Owning Strings for the UTF types these APIs attach to
- Common Tasks for higher-level Unicode operations after text is already validated