Skip to content

Text Operations

Both borrowed and owning text types expose STL-style search APIs, including:

  • contains
  • find
  • rfind
  • find_first_of / find_first_not_of
  • find_last_of / find_last_not_of
  • starts_with
  • ends_with

These work with text, characters, predicates, and span-based character sets depending on the overload.

#include "unicode_ranges_all.hpp"

#include <print>

using namespace unicode_ranges;
using namespace unicode_ranges::literals;

int main()
{
    constexpr auto view = "café café"_utf8_sv;
    auto owned = "été en été"_utf8_s;

    std::println("{}", view);                  // café café
    std::println("{}", view.find("é"_u8c));    // 3
    std::println("{}", view.rfind("é"_u8c));   // 9

    std::println("{}",
        owned.replace_all("é"_u8c, "e"_u8c)); // ete en ete
}

Split and match families

String views expose a broad split/match surface:

  • split, rsplit
  • split_terminator, rsplit_terminator
  • splitn, rsplitn
  • split_inclusive
  • split_trimmed
  • matches, match_indices, rmatches, rmatch_indices
  • split_once, rsplit_once
  • split_whitespace, split_ascii_whitespace

Delimiter behavior is intentionally explicit. See the split sections in the string view reference.

#include "unicode_ranges_all.hpp"

#include <print>

using namespace unicode_ranges;
using namespace unicode_ranges::literals;

int main()
{
    constexpr auto line = " café | thé | apă "_utf8_sv;
    constexpr auto framed = "***café***"_utf8_sv;

    for (auto part : line.split_trimmed("|"_utf8_sv))
    {
        std::println("[{}]", part);
    }
    // [café]
    // [thé]
    // [apă]

    std::println("{}", framed.trim_matches("*"_u8c)); // café
}

The split APIs also compose cleanly with standard range pipelines:

#include "unicode_ranges_all.hpp"

#include <print>
#include <ranges>

using namespace unicode_ranges;
using namespace unicode_ranges::literals;

int main()
{
    auto phrase = "Be the change you want to see in the world"_utf8_s;

    phrase = phrase.split_ascii_whitespace()
        | std::views::transform(&utf8_string_view::chars)
        | std::views::join_with("_"_u8c)
        | std::ranges::to<utf8_string>();

    std::println("{}", phrase); // Be_the_change_you_want_to_see_in_the_world
}

Trim and prefix/suffix operations

Available operations include:

  • strip_prefix, strip_suffix, strip_circumfix
  • trim_prefix, trim_suffix
  • trim_start_matches, trim_end_matches, trim_matches
  • trim_start, trim_end, trim
  • trim_ascii_start, trim_ascii_end, trim_ascii

The matcher-based trim APIs accept characters, text, predicates, and character sets.

Boundary and access APIs

Important boundary-aware APIs include:

  • is_char_boundary
  • ceil_char_boundary
  • floor_char_boundary
  • is_grapheme_boundary
  • ceil_grapheme_boundary
  • floor_grapheme_boundary
  • char_at
  • grapheme_at
  • grapheme_substr
  • substr

These are essential whenever offsets are expressed in code units but user-visible semantics depend on characters or graphemes.

#include "unicode_ranges_all.hpp"

#include <print>

using namespace unicode_ranges;
using namespace unicode_ranges::literals;

int main()
{
    constexpr auto text = "é🇷🇴!"_utf8_sv;

    std::println("{}", text.is_char_boundary(1));         // true
    std::println("{}", text.is_grapheme_boundary(1));     // false
    std::println("{}", text.ceil_grapheme_boundary(7));   // 11
    std::println("{}", text.floor_grapheme_boundary(7));  // 3

    std::println("{}", text.chars());          // [e, ́, 🇷, 🇴, !]
    std::println("{::s}", text.graphemes());   // [é, 🇷🇴, !]
}

Reverse and replace on owning strings

Owning strings add mutating APIs such as:

  • insert
  • pop_back
  • erase
  • reverse()
  • reverse(pos, count = npos)
  • reverse_graphemes()
  • reverse_graphemes(pos, count = npos)
  • replace(...)
  • replace_all(...)
  • replace_n(...)

Case-transformation APIs also support partial overloads on owning strings:

  • to_ascii_lowercase(pos, count)
  • to_ascii_uppercase(pos, count)
  • to_lowercase(pos, count)
  • to_uppercase(pos, count)
#include "unicode_ranges_all.hpp"

#include <print>

using namespace unicode_ranges;
using namespace unicode_ranges::literals;

int main()
{
    auto chars = "é🇷🇴!"_utf8_s;
    auto graphemes = chars;

    chars.reverse();
    graphemes.reverse_graphemes();

    std::println("{}", chars);      // !🇴🇷́e
    std::println("{}", graphemes);  // !🇷🇴é
}

Return-unit semantics

The most important rule to keep in mind:

  • UTF-8 view/string search offsets are byte offsets unless the API name says otherwise.
  • UTF-16 view/string search offsets are code-unit offsets unless the API name says otherwise.

Character- and grapheme-oriented APIs are named explicitly and should be preferred when the distinction matters.

Grapheme-aware operations

Default Unicode grapheme segmentation is supported through:

  • graphemes()
  • grapheme_indices()
  • grapheme_count()
  • grapheme-aware searching and substring APIs

These use default Unicode grapheme-cluster rules rather than locale-specific tailoring.