Benchmarking¶
This page defines how unicode_ranges will be benchmarked against other libraries.
Current implementation note:
unicode_rangesnow usessimdutfas its production runtime backend for the hot UTF validation and UTF-8 -> UTF-16/UTF-32 transcoding paths.- That means current comparative rows in those families measure
unicode_rangesintegration overhead, API shape, allocation behavior, and fallback decisions against rawsimdutfpublic API usage; they are not a claim thatunicode_rangesandsimdutfare independent low-level codec implementations.
The benchmark suite is intended to answer a narrow question:
- for a specific Unicode task, with clearly defined semantics, how does
unicode_rangescompare to the strongest available implementation on each major C++ toolchain?
It is not intended to produce a single marketing number or an "overall winner".
Goals¶
- compare
unicode_rangesagainst strong existing libraries where the feature overlap is real - keep every benchmark as close to a semantic 1:1 comparison as possible
- separate algorithm cost from container/allocation cost
- publish results per toolchain, not as one merged score
- keep the suite reproducible enough that regressions are actionable
Non-goals¶
- no aggregate "fastest Unicode library" claim
- no comparison rows where the libraries do meaningfully different work
- no hidden switching between strict failure and replacement behavior
- no mixing of lazy view creation with owned materialization in the same benchmark row
- no toolchain-specific tuning that invalidates cross-compiler comparisons
Comparison Rules¶
These rules are mandatory. If a candidate library cannot match the row semantics, that row is skipped for that library.
Match semantics first¶
The benchmark target is "same contract", not "same-looking API call".
Examples:
- strict validation and replacement-on-error are different benchmarks
- bounded output and growable append are different benchmarks
- owning-result normalization and lazy normalization view are different benchmarks
- default grapheme segmentation and locale-tailored segmentation are different benchmarks
Prefer the closest realistic public API¶
Rows should use the closest documented public API that a competent user would actually choose for the task.
That means:
- do not reject a comparison just because another library only has a near-match with slightly different edge-case behavior outside the benchmarked corpus
- do not compare against a fundamentally different API shape when that shape clearly bakes in a performance advantage unrelated to the benchmark goal
- do not use obscure internal hooks or unnatural setup code that ordinary users would not write
When exact equivalence is impossible, the row should document the remaining difference and use the most defensible public approximation.
Separate raw and convenience paths¶
Whenever possible, a benchmark family should have two tracks:
- raw or caller-provided output
- convenience or owned-result API
That keeps container growth and allocation policy from being confused with the core algorithm cost.
Keep error handling explicit¶
Every benchmark row must state which of these semantics it uses:
- strict failure
- replacement
- skip or ignore
Rows with different error behavior are not combined.
Report per toolchain¶
Results are reported separately for:
- GCC + libstdc++
- Clang + libc++
- MSVC + MSVC STL
No averages across toolchains. A trend is only considered strong if it appears on at least two toolchains.
Prefer official or primary implementations¶
Comparison baselines should come from the primary project, not from wrappers or secondary bindings, unless the wrapper is the de facto C++ interface being compared.
Candidate Libraries¶
No single library overlaps the full unicode_ranges surface. Comparisons are therefore feature-family-specific.
| Library | Best comparison families | Notes |
|---|---|---|
| simdutf | UTF validation, UTF transcoding | strongest raw UTF codec baseline; also the current unicode_ranges runtime backend for those hot paths |
| ICU | normalization, case mapping, segmentation, legacy encoding conversion | broadest feature overlap; use converter APIs for boundary encodings |
| Boost.Text | transcoding, normalization, segmentation, case mapping | broad algorithm overlap in modern C++ |
| uni-algo | conversion, normalization, case mapping, segmentation | strong safe-Unicode algorithm baseline; strict conversion and validation APIs are public in conv.h |
| utf8proc | UTF-8 normalization, case folding | useful narrow baseline for UTF-8-only Unicode algorithms |
| utfcpp | UTF-8 validation, iteration, UTF conversion | useful UTF-only C++ baseline |
| libiconv | legacy encoding conversion | important baseline once non-UTF boundary encodings expand |
Benchmark Families¶
The suite should be organized by feature family, not by library.
UTF Validation¶
Semantics:
- strict validation
- valid input rows
- invalid input rows with explicit failure
Primary comparisons:
unicode_rangessimdutfBoost.Textuni-algoutfcpp
Interpretation note:
- current
unicode_rangesrows in this family are wrapper/integration comparisons against rawsimdutfusage, not independent codec-algorithm competitions
UTF Transcoding¶
Semantics:
- strict, validating conversion
- same source encoding and target encoding for every row
- separate owned-result and caller-buffer rows where possible
Primary comparisons:
unicode_rangessimdutfBoost.Textuni-algoutfcpp
Interpretation note:
- current
unicode_rangesrows in this family are wrapper/integration comparisons against rawsimdutfusage for the same reason as UTF validation
Normalization¶
Semantics:
- exact normalization form per row: NFC, NFD, NFKC, NFKD
- owned materialization rows separate from any lazy/pipeline rows
Primary comparisons:
unicode_rangesICUBoost.Textuni-algoutf8proc
Case Mapping and Case Folding¶
Semantics:
- ASCII-only rows and full Unicode rows kept separate
- lowercasing, uppercasing, and case folding kept separate
- locale-independent rows only, unless a row is explicitly about locale-sensitive behavior
Primary comparisons:
unicode_rangesICUBoost.Textuni-algoutf8procfor case folding and UTF-8 mapping rows
Grapheme and Word Segmentation¶
Semantics:
- default Unicode segmentation only
- counting rows separate from materialization or iteration rows
Primary comparisons:
unicode_rangesICUBoost.Textuni-algo
Boundary Encodings¶
Semantics:
- same source and target encoding pair per row
- strict failure rows separate from replacement rows
- bounded sink rows separate from growable output rows
Primary comparisons:
unicode_rangesICUconverter APIslibiconv
Initial built-in rows should include:
ascii_strictascii_lossyiso_8859_1iso_8859_15windows_1251windows_1252
Future rows should include:
shift_jis
Corpus Policy¶
Synthetic microbenchmarks are useful, but not enough. Each family should use multiple corpora.
Minimum corpus set:
- ASCII-heavy text
- mixed Western European UTF text
- combining-mark-heavy text
- emoji-heavy text
- Cyrillic or other non-Latin script text
- malformed UTF for strict-validation and replacement rows
- medium-sized payloads
- large payloads
Each corpus must be shared across libraries for that row.
Measurement Policy¶
- use the same benchmark harness shape across all rows
- keep warm-up and sample policy explicit
- report
ns/op, throughput, and iteration count - report allocation-sensitive rows separately when allocation is part of the benchmarked contract
- never hide failed rows; if a library cannot express the required semantics, mark the row unsupported
- unsupported rows should still appear in the suite output with a short reason instead of silently disappearing
Result Interpretation¶
When discussing results:
- compare within one benchmark family first
- compare within one toolchain first
- call out cases where destination/container choice dominates the result
- avoid broad conclusions from a single compiler or one noisy runner
This matters especially for:
- ranges-heavy code
- iterator-heavy code
- growable container output paths
- standard-library-dependent behavior
Planned Implementation Phases¶
Phase 1¶
- benchmark charter and reporting policy
- benchmark project layout
- corpus layout
- toolchain matrix in CI
Phase 2¶
- UTF validation and UTF transcoding comparisons
- initial baselines:
simdutf,utfcpp,uni-algo
Phase 3¶
- normalization and case mapping comparisons
- initial baselines:
ICU,Boost.Text,uni-algo,utf8proc
Phase 4¶
- grapheme and word segmentation comparisons
- initial baselines:
ICU,Boost.Text,uni-algo
Phase 5¶
- boundary encoding comparisons
- initial baselines:
ICUconverters andlibiconv - start with currently built-in single-byte codecs
- extend to
shift_jisafter native support lands
Current Status¶
This page started as the design charter and now also reflects the initial scaffold on the feature/comparative-benchmarks branch.
Current comparative suite:
- a dedicated comparative benchmark runner:
tools/comparative_benchmarks/main.cpp - a shared benchmark model and harness under
tools/comparative_benchmarks/ - initial corpus layout for UTF-8 validation and UTF-8 transcoding rows
- initial
unicode_rangesbaseline adapters for strict UTF-8 validation and strict UTF-8 owned transcoding - initial third-party baselines:
simdutf- pinned to upstream
v7.7.0 - vendored in the repository under
third_party/simdutffor the shipped runtime backend - the comparative CI may still fetch an explicit baseline copy when exercising the standalone
simdutfrow - wired for strict UTF-8 validation and strict UTF-8 transcoding
- pinned to upstream
utfcpp- pinned to tag
v4.0.9 - fetched dynamically in CI through a shallow tag clone
- wired for strict UTF-8 validation and strict UTF-8 transcoding
- pinned to tag
uni-algo- pinned to tag
v1.0.0 - fetched dynamically in CI through a shallow tag clone
- wired for strict UTF-8 validation and strict UTF-8 owned transcoding
- reported as unsupported for current caller-buffer rows because its public conversion API materializes owned strings
- pinned to tag
- strict UTF-8 caller-buffer transcoding rows are present too
simdutfandutfcppare currently the supported external baselines thereuni-algois reported as unsupported there because the public API does not expose caller-buffer UTF transcodingunicode_rangesis reported as unsupported for those rows because it does not currently expose a public caller-buffer UTF transcoding API- comparative dependencies are defined in
tools/comparative_benchmarks/dependencies.jsonand fetched throughtools/fetch_comparative_dependency.ps1 - a manifest-driven dependency fetch script for external comparative baselines
- CI jobs that fetch, build, and run the comparative suite on GCC, Clang, and MSVC
Important current caveat:
- because
unicode_rangesnow usessimdutfas the production runtime backend for UTF validation and UTF-8 -> UTF-16/UTF-32 transcoding, those comparative families should be read primarily as: - wrapper overhead comparisons
- API-shape and allocation-model comparisons
- fallback-policy comparisons rather than as "completely unrelated low-level algorithm A versus algorithm B"
It still does not imply:
- vendored third-party dependencies
- broad cross-library coverage beyond the initial
simdutf,utfcpp, anduni-algobaselines - benchmark rows for normalization, case mapping, segmentation, or boundary encodings
The next implementation phases on this branch are additional external baselines and additional benchmark families.