Getting Started¶
Requirements¶
unicode_ranges requires a compiler and standard library with strong C++23 support.
Minimum toolchains currently exercised in CI:
- MSVC with the MSVC STL: Visual Studio 2022 toolset
v143or newer - Clang-cl with the MSVC STL: current Visual Studio 2022
ClangCL - GCC with libstdc++: GCC 14 / libstdc++ 14 or newer
- Clang with libc++: Clang 22 / libc++ 22 or newer
The checked-in Unicode data currently tracks Unicode 17.0.0.
Install and integrate¶
If you have not wired the library into your build yet, start with Install And Integrate.
Short version:
- today, the normal consumption path is vendoring, a git submodule, or source-fetching in CMake
- build and link the
unicode_rangeslibrary target, or an equivalent library target in your own build - there is not yet a first-party package-manager distribution
- your build needs C++23 and the repository root on the include path
- the repository already vendors pinned
simdutf(v7.7.0) underthird_party/simdutf - runtime UTF validation and UTF-8 <-> UTF-16/UTF-32 transcoding currently go through the
simdutfbackend; compile-time and higher-level APIs remain inunicode_ranges
Include the library¶
Use unicode_ranges_all.hpp when you want the all-in umbrella, including owning string types such as utf8_string, utf16_string, and utf32_string, plus the named unicode_ranges::characters catalog.
Terminology cheat sheet¶
| Term | Meaning here |
|---|---|
| code unit | one UTF-8 byte, one UTF-16 code unit, or one UTF-32 code point unit |
| scalar | one Unicode scalar value |
| grapheme | one user-perceived character under the default Unicode grapheme rules |
| UTF-8 offset | byte offset |
| UTF-16 offset | code-unit offset |
| boundary API | an API such as is_char_boundary() or ceil_grapheme_boundary() that works in terms of valid semantic cut points |
Choose the right entry point¶
- Use
_utf8_sv/_utf16_sv/_utf32_svfor validated compile-time views. - Use
utf8_string::from_bytes(...),utf16_string::from_code_units(...), orutf32_string::from_code_points(...)when input arrives as raw runtime data. - Use
_utf8_s/_utf16_s/_utf32_swhen you want an owning validated string immediately.
A first validated view¶
This is the style the docs will use going forward: visible Unicode text, runnable code, std::println, and comments showing what to expect.
#include "unicode_ranges_all.hpp"
#include <print>
using namespace unicode_ranges;
using namespace unicode_ranges::literals;
int main()
{
constexpr auto text = "é🇷🇴!"_utf8_sv;
std::println("{}", text); // é🇷🇴!
std::println("{}", text.size()); // 12 UTF-8 code units
std::println("{}", text.char_count()); // 5 Unicode scalars
std::println("{}", text.grapheme_count()); // 3 graphemes
std::println("{}", text.find("!"_u8c)); // 11
std::println("{}", text.find("🇷"_u8c)); // 3
std::println("{}", text.chars()); // [e, ́, 🇷, 🇴, !]
std::println("{::s}", text.graphemes()); // [é, 🇷🇴, !]
}
Info
Reading the first example:
éisU+0065 LATIN SMALL LETTER Efollowed byU+0301 COMBINING ACUTE ACCENT.- That means
éis one grapheme, but two scalars. 🇷🇴is one grapheme built from two regional-indicator scalars.- This is why
size(),char_count(), andgrapheme_count()intentionally differ.
Runtime validation¶
When text arrives at runtime as raw bytes, validate it once and keep the validated type:
#include "unicode_ranges_all.hpp"
#include <print>
#include <string>
using namespace unicode_ranges;
int main()
{
std::string raw = "Grüße din România 👋";
auto text = utf8_string::from_bytes(raw);
if (!text)
{
std::println(stderr,
"Invalid UTF-8 at byte {}",
text.error().first_invalid_byte_index);
return 1;
}
std::println("{}", *text); // Grüße din România 👋
std::println("{}", text->char_count()); // 18
std::println("{}", text->front().value()); // G
std::println("{}", text->back().value()); // 👋
}
Formatting and printing¶
Library-defined UTF-8, UTF-16, and UTF-32 types support formatting and printing directly. Borrowed views such as chars() and graphemes() are easy to inspect too. For grapheme views, the examples use "{::s}" so the printed range stays visually uniform with the underlying text:
Warning
std::println("{}", text.chars()) and std::println("{::s}", text.graphemes()) rely on C++23 range-formatting support in the standard library.
- this works on the MSVC STL and on libc++
- libstdc++ 14 does not currently format these custom helper views directly
- the GCC docs-example CI job therefore treats that specific limitation as informational rather than blocking
#include "unicode_ranges_all.hpp"
#include <print>
using namespace unicode_ranges;
using namespace unicode_ranges::literals;
int main()
{
const utf8_string text = "mañana 👩💻"_utf8_s;
std::println("{}", text); // mañana 👩💻
std::println("{}", text.chars()); // [m, a, ñ, a, n, a, , 👩, , 💻]
std::println("{::s}", text.graphemes()); // [m, a, ñ, a, n, a, , 👩💻]
}
Views versus owning strings¶
utf8_string_view/utf16_string_view/utf32_string_viewborrow existing storage.utf8_string/utf16_string/utf32_stringown and mutate storage.- On lvalue text,
chars(),graphemes(),char_indices(), andgrapheme_indices()are borrowing range views. - On rvalue owning strings, those members return move-only owning views so temporary strings do not dangle.
- Materializing an owning string from a same-encoding
chars()or rvaluereversed_chars()view can use a direct storage path.
Do not keep borrowed ranges alive after the source storage dies or after the owning string mutates.
Counting and indexing¶
The library intentionally distinguishes:
- code units:
size() - Unicode scalar values:
char_count() - grapheme clusters:
grapheme_count()
UTF-8 view/string search APIs generally return byte offsets. UTF-16 and UTF-32 view/string search APIs generally return code-unit offsets. Character-oriented APIs are named explicitly, such as char_at, is_char_boundary, and ceil_char_boundary.
Example sanity checks¶
The examples under docs/examples/ are compiled in CI so the docs do not silently drift away from the library surface. The one current exception is direct std::print formatting of helper views on GCC/libstdc++ 14, which is tracked as an informational, non-blocking docs-example failure.