Benchmarking¶

This page defines how unicode_ranges will be benchmarked against other libraries.

Current implementation note:

unicode_ranges now uses simdutf as its production runtime backend for the hot UTF validation and UTF-8 -> UTF-16/UTF-32 transcoding paths.
That means current comparative rows in those families measure unicode_ranges integration overhead, API shape, allocation behavior, and fallback decisions against raw simdutf public API usage; they are not a claim that unicode_ranges and simdutf are independent low-level codec implementations.

The benchmark suite is intended to answer a narrow question:

for a specific Unicode task, with clearly defined semantics, how does unicode_ranges compare to the strongest available implementation on each major C++ toolchain?

It is not intended to produce a single marketing number or an "overall winner".

Goals¶

compare unicode_ranges against strong existing libraries where the feature overlap is real
keep every benchmark as close to a semantic 1:1 comparison as possible
separate algorithm cost from container/allocation cost
publish results per toolchain, not as one merged score
keep the suite reproducible enough that regressions are actionable

Non-goals¶

no aggregate "fastest Unicode library" claim
no comparison rows where the libraries do meaningfully different work
no hidden switching between strict failure and replacement behavior
no mixing of lazy view creation with owned materialization in the same benchmark row
no toolchain-specific tuning that invalidates cross-compiler comparisons

Comparison Rules¶

These rules are mandatory. If a candidate library cannot match the row semantics, that row is skipped for that library.

Match semantics first¶

The benchmark target is "same contract", not "same-looking API call".

Examples:

strict validation and replacement-on-error are different benchmarks
bounded output and growable append are different benchmarks
owning-result normalization and lazy normalization view are different benchmarks
default grapheme segmentation and locale-tailored segmentation are different benchmarks

Prefer the closest realistic public API¶

Rows should use the closest documented public API that a competent user would actually choose for the task.

That means:

do not reject a comparison just because another library only has a near-match with slightly different edge-case behavior outside the benchmarked corpus
do not compare against a fundamentally different API shape when that shape clearly bakes in a performance advantage unrelated to the benchmark goal
do not use obscure internal hooks or unnatural setup code that ordinary users would not write

When exact equivalence is impossible, the row should document the remaining difference and use the most defensible public approximation.

Separate raw and convenience paths¶

Whenever possible, a benchmark family should have two tracks:

raw or caller-provided output
convenience or owned-result API

That keeps container growth and allocation policy from being confused with the core algorithm cost.

Keep error handling explicit¶

Every benchmark row must state which of these semantics it uses:

strict failure
replacement
skip or ignore

Rows with different error behavior are not combined.

Report per toolchain¶

Results are reported separately for:

GCC + libstdc++
Clang + libc++
MSVC + MSVC STL

No averages across toolchains. A trend is only considered strong if it appears on at least two toolchains.

Prefer official or primary implementations¶

Comparison baselines should come from the primary project, not from wrappers or secondary bindings, unless the wrapper is the de facto C++ interface being compared.

Candidate Libraries¶

No single library overlaps the full unicode_ranges surface. Comparisons are therefore feature-family-specific.

Library	Best comparison families	Notes
simdutf	UTF validation, UTF transcoding	strongest raw UTF codec baseline; also the current `unicode_ranges` runtime backend for those hot paths
ICU	normalization, case mapping, segmentation, legacy encoding conversion	broadest feature overlap; use converter APIs for boundary encodings
Boost.Text	transcoding, normalization, segmentation, case mapping	broad algorithm overlap in modern C++
uni-algo	conversion, normalization, case mapping, segmentation	strong safe-Unicode algorithm baseline; strict conversion and validation APIs are public in `conv.h`
utf8proc	UTF-8 normalization, case folding	useful narrow baseline for UTF-8-only Unicode algorithms
utfcpp	UTF-8 validation, iteration, UTF conversion	useful UTF-only C++ baseline
libiconv	legacy encoding conversion	important baseline once non-UTF boundary encodings expand

Benchmark Families¶

The suite should be organized by feature family, not by library.

UTF Validation¶

Semantics:

strict validation
valid input rows
invalid input rows with explicit failure

Primary comparisons:

unicode_ranges
simdutf
Boost.Text
uni-algo
utfcpp

Interpretation note:

current unicode_ranges rows in this family are wrapper/integration comparisons against raw simdutf usage, not independent codec-algorithm competitions

UTF Transcoding¶

Semantics:

strict, validating conversion
same source encoding and target encoding for every row
separate owned-result and caller-buffer rows where possible

Primary comparisons:

unicode_ranges
simdutf
Boost.Text
uni-algo
utfcpp

Interpretation note:

current unicode_ranges rows in this family are wrapper/integration comparisons against raw simdutf usage for the same reason as UTF validation

Normalization¶

Semantics:

exact normalization form per row: NFC, NFD, NFKC, NFKD
owned materialization rows separate from any lazy/pipeline rows

Primary comparisons:

unicode_ranges
ICU
Boost.Text
uni-algo
utf8proc

Case Mapping and Case Folding¶

Semantics:

ASCII-only rows and full Unicode rows kept separate
lowercasing, uppercasing, and case folding kept separate
locale-independent rows only, unless a row is explicitly about locale-sensitive behavior

Primary comparisons:

unicode_ranges
ICU
Boost.Text
uni-algo
utf8proc for case folding and UTF-8 mapping rows

Grapheme and Word Segmentation¶

Semantics:

default Unicode segmentation only
counting rows separate from materialization or iteration rows

Primary comparisons:

unicode_ranges
ICU
Boost.Text
uni-algo

Boundary Encodings¶

Semantics:

same source and target encoding pair per row
strict failure rows separate from replacement rows
bounded sink rows separate from growable output rows

Primary comparisons:

unicode_ranges
ICU converter APIs
libiconv

Initial built-in rows should include:

ascii_strict
ascii_lossy
iso_8859_1
iso_8859_15
windows_1251
windows_1252

Future rows should include:

shift_jis

Corpus Policy¶

Synthetic microbenchmarks are useful, but not enough. Each family should use multiple corpora.

Minimum corpus set:

ASCII-heavy text
mixed Western European UTF text
combining-mark-heavy text
emoji-heavy text
Cyrillic or other non-Latin script text
malformed UTF for strict-validation and replacement rows
medium-sized payloads
large payloads

Each corpus must be shared across libraries for that row.

Measurement Policy¶

use the same benchmark harness shape across all rows
keep warm-up and sample policy explicit
report ns/op, throughput, and iteration count
report allocation-sensitive rows separately when allocation is part of the benchmarked contract
never hide failed rows; if a library cannot express the required semantics, mark the row unsupported
unsupported rows should still appear in the suite output with a short reason instead of silently disappearing

Result Interpretation¶

When discussing results:

compare within one benchmark family first
compare within one toolchain first
call out cases where destination/container choice dominates the result
avoid broad conclusions from a single compiler or one noisy runner

This matters especially for:

ranges-heavy code
iterator-heavy code
growable container output paths
standard-library-dependent behavior

Planned Implementation Phases¶

Phase 1¶

benchmark charter and reporting policy
benchmark project layout
corpus layout
toolchain matrix in CI

Phase 2¶

UTF validation and UTF transcoding comparisons
initial baselines: simdutf, utfcpp, uni-algo

Phase 3¶

normalization and case mapping comparisons
initial baselines: ICU, Boost.Text, uni-algo, utf8proc

Phase 4¶

grapheme and word segmentation comparisons
initial baselines: ICU, Boost.Text, uni-algo

Phase 5¶

boundary encoding comparisons
initial baselines: ICU converters and libiconv
start with currently built-in single-byte codecs
extend to shift_jis after native support lands

Current Status¶

This page started as the design charter and now also reflects the initial scaffold on the feature/comparative-benchmarks branch.

Current comparative suite:

a dedicated comparative benchmark runner: tools/comparative_benchmarks/main.cpp
a shared benchmark model and harness under tools/comparative_benchmarks/
initial corpus layout for UTF-8 validation and UTF-8 transcoding rows
initial unicode_ranges baseline adapters for strict UTF-8 validation and strict UTF-8 owned transcoding
initial third-party baselines:
simdutf
- pinned to upstream v7.7.0
- vendored in the repository under third_party/simdutf for the shipped runtime backend
- the comparative CI may still fetch an explicit baseline copy when exercising the standalone simdutf row
- wired for strict UTF-8 validation and strict UTF-8 transcoding
utfcpp
- pinned to tag v4.0.9
- fetched dynamically in CI through a shallow tag clone
- wired for strict UTF-8 validation and strict UTF-8 transcoding
uni-algo
- pinned to tag v1.0.0
- fetched dynamically in CI through a shallow tag clone
- wired for strict UTF-8 validation and strict UTF-8 owned transcoding
- reported as unsupported for current caller-buffer rows because its public conversion API materializes owned strings
strict UTF-8 caller-buffer transcoding rows are present too
simdutf and utfcpp are currently the supported external baselines there
uni-algo is reported as unsupported there because the public API does not expose caller-buffer UTF transcoding
unicode_ranges is reported as unsupported for those rows because it does not currently expose a public caller-buffer UTF transcoding API
comparative dependencies are defined in tools/comparative_benchmarks/dependencies.json and fetched through tools/fetch_comparative_dependency.ps1
a manifest-driven dependency fetch script for external comparative baselines
CI jobs that fetch, build, and run the comparative suite on GCC, Clang, and MSVC

Important current caveat:

because unicode_ranges now uses simdutf as the production runtime backend for UTF validation and UTF-8 -> UTF-16/UTF-32 transcoding, those comparative families should be read primarily as:
wrapper overhead comparisons
API-shape and allocation-model comparisons
fallback-policy comparisons rather than as "completely unrelated low-level algorithm A versus algorithm B"

It still does not imply:

vendored third-party dependencies
broad cross-library coverage beyond the initial simdutf, utfcpp, and uni-algo baselines
benchmark rows for normalization, case mapping, segmentation, or boundary encodings

The next implementation phases on this branch are additional external baselines and additional benchmark families.