External Benchmarking
Methodology and operating rules for serious external benchmark comparisons.
ferrocat-bench exposes a manual, reproducible comparison suite for external baselines.
The internal microbenchmarks remain the fast day-to-day performance loop. The external comparison suite is for serious cross-runtime checkpoints on a documented reference host, especially when comparing Ferrocat against GNU gettext, Node, and Python ecosystem tools.
Choosing Comparison Targets
Comparison targets are kept to libraries that are both widely used and actively maintained -- the tools real projects actually depend on. Niche or single-author experiments are deliberately left out: a fast number against an unmaintained parser is not a meaningful baseline.
Because Ferrocat ships into JS/TS frontends through Palamedes, the most relevant
ICU MessageFormat baselines are the established JavaScript parsers
(@formatjs/icu-messageformat-parser, @messageformat/parser), not Rust-native
ICU crates. The Rust ecosystem has no widely adopted MessageFormat-1 parser, and
ICU4X targets MessageFormat 2.0 only, so there is no fair same-language ICU peer.
The same holds for PO itself: there is no widely adopted Rust PO library to
compare against. The one Rust PO crate (polib) has effectively no ecosystem
adoption (single author, ~14 stars, no reverse dependencies on crates.io), so it
is left out for the same reason as niche ICU parsers. PO comparisons therefore
run against the established JS, Python, and PHP tools (pofile,
gettext-parser, Python polib, and PHP gettext/gettext). For catalog
updates, gettext/gettext's mergeWith, babel's Catalog.update, and GNU
msgmerge are the maintained tools with real merge semantics.
ICU compile-to-runtime (@messageformat/core, @formatjs/cli, Lingui) is a
separate axis and is intentionally not wired up yet: those tools precompile ICU
to ASTs/functions, whereas Ferrocat's compile_catalog_artifact resolves
catalogs to runtime message strings and leaves ICU parsing to the client. A
direct compile-vs-compile benchmark would compare different operations.
Comparison Targets At A Glance
Every external tool the suite measures against, with the pinned version captured
in report metadata. pofile-ts is our own performance-focused fork of pofile,
so the fastest JavaScript PO parser in the suite is one we maintain ourselves --
the comparison deliberately uses the harder JS target rather than the slower
original.
| Tool | Runtime | Role in the suite | Pinned version |
|---|---|---|---|
| pofile-ts (our fork) | Node | PO parse, serialize, merge | 4.0.3 |
| pofile | Node | PO parse, serialize, merge | 1.1.4 |
| gettext-parser | Node | PO parse, serialize | 9.0.2 |
| polib | Python | PO parse, serialize, merge | 1.2.0 |
| Babel | Python | catalog update | 2.18.0 |
| gettext/gettext | PHP | PO parse, serialize | 5.7 |
GNU gettext (msgmerge, msgcat) | C | merge, concat | system |
| @formatjs/icu-messageformat-parser | Node | ICU parse | 3.5.11 |
| @messageformat/parser | Node | ICU parse | 5.1.1 |
Scheduled Rust Reports
The Benchmark Reports GitHub Actions workflow runs every Monday and can also
be started manually. It uses the Rust-only rust-scheduled-v1 profile by
default, writes a JSON report under target/benchmark/, and uploads that report
as a workflow artifact.
Scheduled hosted-runner reports are trend visibility and noise detection, not
publication-grade benchmarks. Pull request runs compare the PR report against a
fresh baseline report from the PR base SHA and fail only when matching,
non-noisy scenarios exceed a 20% median elapsed regression. Scenarios marked
noisy by coefficient of variation or relative span are reported in the summary
but do not fail the workflow.
The comparison command is also available locally:
cargo run --release -p ferrocat-bench -- regression-check \
--baseline target/benchmark/rust-scheduled-v1-baseline.json \
--current target/benchmark/rust-scheduled-v1-current.json \
--max-regression-percent 20Regression Budgets
The rust-scheduled-v1 profile is the PR-visible regression guard. Its budget
is intentionally relative to a same-run baseline rather than a fixed hosted
runner throughput number:
- fail matching, non-noisy scenarios above
20%median elapsed regression - skip noisy scenarios and surface them in the job summary
- treat scheduled reports as trend artifacts unless a maintainer promotes one into release notes or a reference-host checkpoint
The slowest current Rust-only path under review was
po-update/catalog-icu-heavy/ferrocat. Its budget is no worse than the same
20% PR regression band. If it trips the guard or a same-host local run drops
below roughly 40 MiB/s, profile the path before changing the threshold. The
first profiling pass should separate repeated ICU parsing/projection, plural
category resolution, allocation/cloning, serialization, and catalog
normalization costs.
Reproduce A Fast Baseline
For a quick local signal before and after hot-path changes, use the quick official profile:
cargo run --release -p ferrocat-bench -- compare gettext-official-quick-v1 --out target/benchmark/gettext-official-quick-v1.jsonKeep the generated JSON report with your notes when a change is performance-motivated. For numbers you intend to publish or compare publicly, use the full gettext-official-v1 profile on the documented reference host.
Reference Host Rules
- use one documented benchmark machine for official comparisons
- keep Rust, Node, Python, and GNU gettext versions fixed across report runs
- minimize background load and network activity during a run
- keep the machine on AC power
- compare reports only within the same host and toolchain class
Required Tooling
- Rust toolchain able to run
cargo run -p ferrocat-bench - Node.js plus the packages declared in
benchmark/node/package.json - Python 3 plus the packages declared in
benchmark/python/requirements.txt - PHP 8.1+ and Composer for the
gettext/gettextadapter underbenchmark/php - GNU gettext commands
msgcatandmsgmerge
Suggested setup:
./benchmark/setup.shIf benchmark/python/.venv exists, ferrocat-bench will automatically prefer that interpreter for verify-benchmark-env and compare, so polib does not need to be installed globally.
If you only want the Python side, run:
./benchmark/python/setup.shVerify The Environment
Run:
cargo run -p ferrocat-bench -- verify-benchmark-envThis checks the required executables and adapter packages and prints the detected tool versions that will be captured in the report metadata.
Benchmark Profiles
rust-scheduled-v1- Rust-only scheduled/reporting profile used by the
Benchmark Reportsworkflow - covers owned/borrowed PO parse, PO stringify, merge, update, catalog storage parse, and ICU parse scenarios
- avoids Node, Python, and GNU gettext adapter setup so the scheduled run stays about Ferrocat internals
- Rust-only scheduled/reporting profile used by the
gettext-official-v1- the smallest official benchmark profile
- intentionally benchmark-focused rather than test-focused
- one conservative primary locale:
de - one second normal locale:
fr - one more complex plural locale:
pl - one representative large corpus size per scenario
gettext-official-quick-v1- the fast companion to
gettext-official-v1 - keeps the same fixture and external-tool matrix
- lowers the minimum sample duration
- uses fewer warmup and measured runs
- useful for local iteration and regression checks, but not the publication-grade profile
- the fast companion to
gettext-compat-v1- extended external benchmark suite
- broader gettext-only matrix with additional locale/family coverage
- useful when you want more detail than the slim official profile
gettext-workflows-ecosystem-v1- extended workflow suite for classic gettext merge paths
- compares
merge_catalogagainstmsgmerge,pofile,pofile-ts, andpolib - every tool runs the same file-to-file pipeline: parse
existing.po, parsetemplate.pot, merge, and serialize. Ferrocat reads the same.potinstead of pre-structured messages, so the update comparison stays strictly apples-to-apples - the validation reparse used for the equivalence digest runs once outside the timed loop for every implementation, so no tool is charged for it
- useful when you want workflow numbers across the broader gettext ecosystem
serious-v1- advanced/internal benchmark suite
- mixed and ICU-heavy workloads
- useful for
ferrocat's broader performance direction, but not the official cross-tool gettext baseline
catalog-update-v1- catalog update/merge comparison against the real, maintained update tools
- compares
ferrocatmerge against GNUmsgmergeand Pythonbabel'sCatalog.updateonde/fr babelruns its realCatalog.updatewithno_fuzzy_matching=Trueto stay aligned withferrocat's exact-identity merge model
Run The Official Gettext Suite
Use the checked-in gettext-official-v1 profile and write the report outside the internal performance history:
cargo run --release -p ferrocat-bench -- compare gettext-official-v1 --out benchmark/results/gettext-official-v1-$(date +%Y%m%d-%H%M%S).jsonThe compare command:
- validates semantic equivalence for each comparison group before timing
- calibrates iterations per scenario to a minimum sample duration
- runs 2 warmups per scenario
- records 10 measured samples per parse/stringify scenario
- stores raw samples plus aggregated statistics in JSON
For a quicker checkpoint with the same comparison matrix:
cargo run --release -p ferrocat-bench -- compare gettext-official-quick-v1 --out benchmark/results/gettext-official-quick-v1-$(date +%Y%m%d-%H%M%S).jsonThat profile currently uses:
minimum_sample_millis: 100- 1 warmup and 3 measured samples for parse/stringify scenarios
- 1 warmup and 2 measured samples for workflow scenarios
Use it for faster day-to-day checks. Keep gettext-official-v1 as the primary report for published comparisons.
For GNU gettext CLI scenarios, the report also records an empty-cli-run baseline using a minimal header-only input. This adds:
baseline_elapsed_nsand adjusted sample fields formsgcat/msgmerge- adjusted median statistics alongside the raw end-to-end statistics
The raw timing remains the primary comparison number. The adjusted timing is a secondary estimate for understanding how much of the CLI measurement is fixed overhead and how much is actual fixture work. On the 10k-message corpora that fixed overhead is about 2% of the measured time, so the GNU CLI gap reflects real merge and serialization work rather than process startup or file I/O. This is why the published msgmerge number is not treated as a launch-cost artifact.
For the workflow ecosystem suite (the catalog-update comparison):
cargo run --release -p ferrocat-bench -- compare gettext-workflows-ecosystem-v1 --out benchmark/results/gettext-workflows-ecosystem-v1-$(date +%Y%m%d-%H%M%S).jsonThat profile compares merge_catalog against:
msgmergepofilepofile-tspolib
Every tool runs the same file-to-file pipeline (parse existing.po, parse template.pot, merge, serialize), so the catalog-update numbers stay strictly apples-to-apples. The msgmerge path runs with --no-fuzzy-matching, which keeps the comparison close to ferrocat's exact-match merge model instead of pitting it against heuristic fuzzy recovery.
For the broader compatibility/detail suite:
cargo run --release -p ferrocat-bench -- compare gettext-compat-v1 --out benchmark/results/gettext-compat-v1-$(date +%Y%m%d-%H%M%S).jsonUse this when you want more fixture variety than the slim official profile provides.
Result Storage
- External comparison reports should be written under
benchmark/results/ - Scheduled Rust-only workflow reports stay attached to the
Benchmark Reportsworkflow run unless they are promoted into a release or publication checkpoint
Current gettext-official-v1 Shape
gettext-ui-de-10000gettext-saas-fr-10000gettext-commerce-pl-10000
External baselines currently wired:
polib,pofile,pofile-ts, andgettext-parseron the classic gettext parse/stringify corpora:gettext-ui-de-10000,gettext-saas-fr-10000,gettext-commerce-pl-10000msgcaton stringify comparisonsmsgmergeon the conservative merge corpus, with--no-fuzzy-matchingferrocatinternal owned vs borrowed parse baselines onde,fr, andpl
Workflow-only baselines currently wired:
pofile,pofile-ts, andpolibongettext-workflows-ecosystem-v1- each measured as parse -> merge -> serialize pipelines on
gettext-ui-de-1000andgettext-ui-de-10000 gettext-parseris intentionally excluded from workflow benchmarking for now because its PO compile/parse path does not preserve obsolete entries in a way that is semantically fair formsgmerge-style workflowsupdate_catalogis intentionally excluded from the public cross-tool benchmark tables because it is a broader catalog-maintenance API without a clean direct equivalent in the external comparison set
This is intentional. The official profile answers the small, understandable benchmark question first. The broader gettext-compat-v1 profile is available when you want more detail, and the advanced mixed-* / ICU-heavy corpora remain separate from the official Gettext comparison track.
Reporting Expectations
When you share benchmark results from the external suite, include the environment block from the JSON report together with the throughput table. At minimum, keep these fields visible:
system_labeloscpu_modelmemory_bytesrustc_versionnode_versionpython_versionmsgcat_version/msgmerge_versionwhen GNU gettext numbers are shown
This keeps published numbers tied to the machine and toolchain they were measured on without exposing private hostnames.