Casing and Normalization¶
ASCII versus Unicode casing¶
The library exposes both ASCII-only and Unicode-aware casing APIs.
ASCII-only:
to_ascii_lowercase()to_ascii_uppercase()
Unicode-aware:
to_lowercase()to_uppercase()
The Unicode-aware APIs are locale-independent. They follow generated Unicode tables and do not apply locale-specific tailoring.
#include "unicode_ranges_all.hpp"
#include <print>
using namespace unicode_ranges;
using namespace unicode_ranges::literals;
int main()
{
constexpr auto text = "straße café"_utf8_sv;
std::println("{}", text.to_ascii_uppercase()); // STRAßE CAFé
std::println("{}", text.to_uppercase()); // STRASSE CAFÉ
std::println("{}", "CAFÉ Ω"_utf8_sv.to_lowercase()); // café ω
}
If the library is built with UTF8_RANGES_ENABLE_ICU=1, additional ICU-backed locale overloads are available for lowercasing, uppercasing, titlecasing, and case folding:
#include "unicode_ranges_all.hpp"
#include <print>
using namespace unicode_ranges;
using namespace unicode_ranges::literals;
int main()
{
#if UTF8_RANGES_HAS_ICU
std::println("{}", u8"I\u0130"_utf8_sv.case_fold()); // ii̇
std::println("{}", u8"I\u0130"_utf8_sv.case_fold("tr"_locale)); // ıi
std::println("{}", u8"I\u0130"_utf8_sv.to_lowercase("tr"_locale)); // ıi
std::println("{}", u8"i\u0131"_utf8_sv.to_uppercase("tr"_locale)); // İI
std::println("{}", u8"istanbul izmir"_utf8_sv.to_titlecase("tr"_locale)); // İstanbul İzmir
std::println("{}", U"istanbul izmir"_utf32_sv.to_titlecase("tr"_locale)); // İstanbul İzmir
std::println("{}", u8"I"_utf8_sv.eq_ignore_case(u8"\u0131"_utf8_sv, "tr"_locale)); // true
std::println("{}", is_available_locale("tr"_locale)); // true
#else
std::println("Enable ICU-backed locale casing to use _locale.");
#endif
}
You can also check whether the current ICU data set explicitly exposes a locale identifier:
is_available_locale(...) is a non-throwing probe. It returns false for invalid or unusable locale identifiers as well as for locale identifiers that are simply not exposed by the current ICU data set.
Behavior note:
locale_idis a raw null-terminated locale-name token._localerejects embedded NULs in string literals at compile time.- Raw
locale_id{ ... }values do not own storage; the pointed-to locale name must stay alive for the duration of the call. - The locale-aware overloads reject obviously unusable tokens such as
locale_id{ nullptr }. - Otherwise, locale-aware
to_lowercase(...),to_uppercase(...),to_titlecase(...), andcase_fold(...)forward the locale name to ICU. - If the locale is not explicitly available in the current ICU data set, ICU may canonicalize the locale or fall back to a more general locale instead of failing the call.
- If ICU rejects the locale name or another ICU operation fails, the locale-aware overload throws
std::runtime_error. - If you need an exact availability check before calling a locale-aware casing overload, use
is_available_locale(...).
to_titlecase(locale) is whole-string only. That is intentional: titlecasing depends on ICU break-iterator context, so partial pos, count overloads would have less predictable semantics than lowercasing or uppercasing a checked slice.
Those overloads do not exist in the dependency-free default build.
Partial casing on owning strings¶
Owning strings support both whole-string and subrange casing. The checked subrange overloads validate bounds and character boundaries. Whole-string overloads remain the cheaper path.
#include "unicode_ranges_all.hpp"
#include <print>
using namespace unicode_ranges;
using namespace unicode_ranges::literals;
int main()
{
auto title = "café noir"_utf8_s;
std::println("{}", title.to_uppercase()); // CAFÉ NOIR
std::println("{}", title.to_uppercase(6, utf8_string::npos)); // café NOIR
}
Case folding¶
Case folding is the Unicode form intended for caseless matching rather than display transformation.
If ICU is enabled, case_fold(locale) is also available. In practice, the locale-sensitive difference is the Turkic special-I fold; most locales produce the same result as the default case_fold().
#include "unicode_ranges_all.hpp"
#include <print>
using namespace unicode_ranges;
using namespace unicode_ranges::literals;
int main()
{
std::println("{}", "Straße"_utf8_sv.case_fold()); // strasse
}
That is different from simply lowercasing text. Case folding handles mappings such as German ß in a way intended for case-insensitive comparison behavior.
The string-view types also expose allocation-free helpers built on top of default Unicode case folding:
eq_ignore_case(...)starts_with_ignore_case(...)ends_with_ignore_case(...)compare_ignore_case(...)
Those helpers do not normalize. That is intentional:
- case folding and normalization stay separate operations in this library
- caseless comparison should not silently add normalization work or broaden equivalence
- callers who need canonical-equivalence-aware caseless matching should say so explicitly
So the default rule is: case-fold only. If you want canonical-equivalence-aware caseless matching, normalize explicitly first and then compare.
If ICU is enabled, locale-aware overloads of those helpers are also available:
eq_ignore_case(..., locale)starts_with_ignore_case(..., locale)ends_with_ignore_case(..., locale)compare_ignore_case(..., locale)
They still compare folded scalar sequences without materializing a temporary folded string, but they are not noexcept because locale handling follows the same ICU-backed rules as case_fold(locale).
Normalization¶
Supported normalization forms:
- NFC
- NFD
- NFKC
- NFKD
#include "unicode_ranges_all.hpp"
#include <print>
using namespace unicode_ranges;
using namespace unicode_ranges::literals;
int main()
{
constexpr auto composed = "é"_utf8_sv;
constexpr auto decomposed = "é"_utf8_sv;
std::println("{}", composed.to_nfd()); // é
std::println("{}", decomposed.to_nfc()); // é
std::println("{}", "A"_utf8_sv.to_nfkc()); // A
}
And corresponding checks:
#include "unicode_ranges_all.hpp"
#include <print>
using namespace unicode_ranges;
using namespace unicode_ranges::literals;
int main()
{
constexpr auto composed = "é"_utf8_sv;
constexpr auto decomposed = "é"_utf8_sv;
std::println("{}", composed.is_nfc()); // true
std::println("{}", decomposed.is_nfc()); // false
std::println("{}", decomposed.is_nfd()); // true
}
Why normalization matters¶
Some text has multiple valid Unicode representations. For example, an accented character may be stored as:
- a single precomposed scalar
- or a base character plus combining mark
Normalization lets callers put text into a stable canonical or compatibility form before storage, comparison, or further processing.
API design choice: no partial normalization¶
The library intentionally does not expose normalize(pos, count) style APIs.
Normalization is less local than casing because boundary behavior can depend on combining marks and composition opportunities around the edge of a slice. Whole-string normalization is the safer and clearer initial contract.
Current scope¶
Implemented:
- Unicode lowercasing and uppercasing
- full Unicode case folding
- NFC, NFD, NFKC, NFKD
Out of scope:
- built-in locale-specific casing tables without ICU
- locale-specific collation
- Turkic-specific case-fold tailoring switches