Text Operations¶

Search¶

Both borrowed and owning text types expose STL-style search APIs, including:

contains
find
rfind
find_first_of / find_first_not_of
find_last_of / find_last_not_of
starts_with
ends_with

These work with text, characters, predicates, and span-based character sets depending on the overload.

#include "unicode_ranges_all.hpp"

#include <print>

using namespace unicode_ranges;
using namespace unicode_ranges::literals;

int main()
{
    constexpr auto view = "café café"_utf8_sv;
    auto owned = "été en été"_utf8_s;

    std::println("{}", view);                  // café café
    std::println("{}", view.find("é"_u8c));    // 3
    std::println("{}", view.rfind("é"_u8c));   // 9

    std::println("{}",
        owned.replace_all("é"_u8c, "e"_u8c)); // ete en ete
}

Split and match families¶

String views expose a broad split/match surface:

split, rsplit
split_terminator, rsplit_terminator
splitn, rsplitn
split_inclusive
split_trimmed
matches, match_indices, rmatches, rmatch_indices
split_once, rsplit_once
split_whitespace, split_ascii_whitespace

Delimiter behavior is intentionally explicit. See the split sections in the string view reference.

#include "unicode_ranges_all.hpp"

#include <print>

using namespace unicode_ranges;
using namespace unicode_ranges::literals;

int main()
{
    constexpr auto line = " café | thé | apă "_utf8_sv;
    constexpr auto framed = "***café***"_utf8_sv;

    for (auto part : line.split_trimmed("|"_utf8_sv))
    {
        std::println("[{}]", part);
    }
    // [café]
    // [thé]
    // [apă]

    std::println("{}", framed.trim_matches("*"_u8c)); // café
}

The split APIs also compose cleanly with standard range pipelines:

#include "unicode_ranges_all.hpp"

#include <print>
#include <ranges>

using namespace unicode_ranges;
using namespace unicode_ranges::literals;

int main()
{
    auto phrase = "Be the change you want to see in the world"_utf8_s;

    phrase = phrase.split_ascii_whitespace()
        | std::views::transform(&utf8_string_view::chars)
        | std::views::join_with("_"_u8c)
        | std::ranges::to<utf8_string>();

    std::println("{}", phrase); // Be_the_change_you_want_to_see_in_the_world
}

Trim and prefix/suffix operations¶

Available operations include:

trim_prefix, trim_suffix
trim_start_matches, trim_end_matches, trim_matches
trim_whitespace_start, trim_whitespace_end, trim_whitespace
trim_ascii_whitespace_start, trim_ascii_whitespace_end, trim_ascii_whitespace

The matcher-based trim APIs accept characters, text, predicates, and character sets.

Return ownership follows the receiver:

On utf*_string_view, trim/substr APIs return borrowed views into the original storage.
On utf*_string lvalues, trim/substr APIs return owning strings so the result cannot dangle.
On utf*_string rvalues, trim/substr APIs return owning strings and consume the source; they trim or slice the existing buffer in place for common bound-adjustment cases.
One-shot APIs that return borrowed subviews, such as split_once, rsplit_once, split_once_at, and grapheme_at, are intentionally deleted for owning rvalues.

#include "unicode_ranges_all.hpp"

#include <cassert>
#include <ranges>
#include <utility>

using namespace unicode_ranges;
using namespace unicode_ranges::literals;

int main()
{
    utf8_string text = u8"  café  "_utf8_s;

    auto copied = text.trim_whitespace();
    assert(copied == u8"café"_utf8_sv);
    assert(text == u8"  café  "_utf8_sv);

    auto disposable = u8"  café  "_utf8_s;
    auto trimmed = std::move(disposable).trim_whitespace();
    assert(trimmed == u8"café"_utf8_sv);

    auto framed = u8"<<<payload>>>"_utf8_s;
    auto payload = std::move(framed).trim_prefix(u8"<<<"_utf8_sv).trim_suffix(u8">>>"_utf8_sv);
    assert(payload == u8"payload"_utf8_sv);

    utf8_string key_value = u8"name=value"_utf8_s;
    auto split = key_value.split_once(u8"="_u8c);
    assert(split.has_value());
    assert(split.left() == u8"name"_utf8_sv);
    assert(split.right() == u8"value"_utf8_sv);
    assert(std::ranges::size(split) == 2);
}

Boundary and access APIs¶

Important boundary-aware APIs include:

is_char_boundary
ceil_char_boundary
floor_char_boundary
is_grapheme_boundary
ceil_grapheme_boundary
floor_grapheme_boundary
char_at
grapheme_at
grapheme_substr
substr

These are essential whenever offsets are expressed in code units but user-visible semantics depend on characters or graphemes.

#include "unicode_ranges_all.hpp"

#include <print>

using namespace unicode_ranges;
using namespace unicode_ranges::literals;

int main()
{
    constexpr auto text = "é🇷🇴!"_utf8_sv;

    std::println("{}", text.is_char_boundary(1));         // true
    std::println("{}", text.is_grapheme_boundary(1));     // false
    std::println("{}", text.ceil_grapheme_boundary(7));   // 11
    std::println("{}", text.floor_grapheme_boundary(7));  // 3

    std::println("{}", text.chars());          // [e, ́, 🇷, 🇴, !]
    std::println("{::s}", text.graphemes());   // [é, 🇷🇴, !]
}

Reverse and replace on owning strings¶

Owning strings (utf8_string, utf16_string, and utf32_string) add mutating APIs such as:

insert
pop_back
erase
reverse()
reverse(pos, count = npos)
reverse_graphemes()
reverse_graphemes(pos, count = npos)
replace(...)
replace_all(...)
replace_n(...)

Case-transformation APIs also support partial overloads on owning strings:

to_ascii_lowercase(pos, count)
to_ascii_uppercase(pos, count)
to_lowercase(pos, count)
to_uppercase(pos, count)

#include "unicode_ranges_all.hpp"

#include <print>

using namespace unicode_ranges;
using namespace unicode_ranges::literals;

int main()
{
    auto chars = "é🇷🇴!"_utf8_s;
    auto graphemes = chars;

    chars.reverse();
    graphemes.reverse_graphemes();

    std::println("{}", chars);      // !🇴🇷́e
    std::println("{}", graphemes);  // !🇷🇴é
}

reverse_graphemes() is an owning-string mutator. It is not available on string views, and there is no lazy reversed_graphemes() view.

A reverse-grapheme view would need to discover grapheme boundaries in forward order, store the resulting slices, and then iterate that storage backwards. That hidden allocation is too surprising for creating a view. Materialize graphemes() into a container and reverse that container when this behavior is needed.

Return-unit semantics¶

The most important rule to keep in mind:

UTF-8 view/string search offsets are byte offsets unless the API name says otherwise.
UTF-16 view/string search offsets are code-unit offsets unless the API name says otherwise.

Use the explicitly named character- and grapheme-oriented APIs when the distinction matters.

Grapheme-aware operations¶

Default Unicode grapheme segmentation is supported through:

graphemes()
grapheme_indices()
grapheme_count()
grapheme-aware searching and substring APIs

These use default Unicode grapheme-cluster rules rather than locale-specific tailoring.