External Benchmarking

Methodology and operating rules for serious external benchmark comparisons.

ferrocat-bench exposes a manual, reproducible comparison suite for external baselines.

The internal microbenchmarks remain the fast day-to-day performance loop. The external comparison suite is for serious cross-runtime checkpoints on a documented reference host, especially when comparing Ferrocat against GNU gettext, Node, and Python ecosystem tools.

Choosing Comparison Targets

Comparison targets are kept to libraries that are both widely used and actively maintained -- the tools real projects actually depend on. Niche or single-author experiments are deliberately left out: a fast number against an unmaintained parser is not a meaningful baseline.

Because Ferrocat ships into JS/TS frontends through Palamedes, the most relevant ICU MessageFormat baselines are the established JavaScript parsers (@formatjs/icu-messageformat-parser, @messageformat/parser), not Rust-native ICU crates. The Rust ecosystem has no widely adopted MessageFormat-1 parser, and ICU4X targets MessageFormat 2.0 only, so there is no fair same-language ICU peer.

The same holds for PO itself: there is no widely adopted Rust PO library to compare against. The one Rust PO crate (polib) has effectively no ecosystem adoption (single author, ~14 stars, no reverse dependencies on crates.io), so it is left out for the same reason as niche ICU parsers. PO comparisons therefore run against the established JS, Python, and PHP tools (pofile, gettext-parser, Python polib, and PHP gettext/gettext). For catalog updates, gettext/gettext's mergeWith, babel's Catalog.update, and GNU msgmerge are the maintained tools with real merge semantics.

ICU compile-to-runtime (@messageformat/core, @formatjs/cli, Lingui) is a separate axis and is intentionally not wired up yet: those tools precompile ICU to ASTs/functions, whereas Ferrocat's compile_catalog_artifact resolves catalogs to runtime message strings and leaves ICU parsing to the client. A direct compile-vs-compile benchmark would compare different operations.

Comparison Targets At A Glance

Every external tool the suite measures against, with the pinned version captured in report metadata. pofile-ts is our own performance-focused fork of pofile, so the fastest JavaScript PO parser in the suite is one we maintain ourselves -- the comparison deliberately uses the harder JS target rather than the slower original.

Tool	Runtime	Role in the suite	Pinned version
pofile-ts (our fork)	Node	PO parse, serialize, merge	4.0.3
pofile	Node	PO parse, serialize, merge	1.1.4
gettext-parser	Node	PO parse, serialize	9.0.2
polib	Python	PO parse, serialize, merge	1.2.0
Babel	Python	catalog update	2.18.0
gettext/gettext	PHP	PO parse, serialize	5.7
GNU gettext (`msgmerge`, `msgcat`)	C	merge, concat	system
@formatjs/icu-messageformat-parser	Node	ICU parse	3.5.11
@messageformat/parser	Node	ICU parse	5.1.1

Scheduled Rust Reports

The Benchmark Reports GitHub Actions workflow runs every Monday and can also be started manually. It uses the Rust-only rust-scheduled-v1 profile by default, writes a JSON report under target/benchmark/, and uploads that report as a workflow artifact.

Scheduled hosted-runner reports are trend visibility and noise detection, not publication-grade benchmarks. Pull request runs compare the PR report against a fresh baseline report from the PR base SHA and fail only when matching, non-noisy scenarios exceed a 20% median elapsed regression. Scenarios marked noisy by coefficient of variation or relative span are reported in the summary but do not fail the workflow.

The comparison command is also available locally:

cargo run --release -p ferrocat-bench -- regression-check \
  --baseline target/benchmark/rust-scheduled-v1-baseline.json \
  --current target/benchmark/rust-scheduled-v1-current.json \
  --max-regression-percent 20

Regression Budgets

The rust-scheduled-v1 profile is the PR-visible regression guard. Its budget is intentionally relative to a same-run baseline rather than a fixed hosted runner throughput number:

fail matching, non-noisy scenarios above 20% median elapsed regression
skip noisy scenarios and surface them in the job summary
treat scheduled reports as trend artifacts unless a maintainer promotes one into release notes or a reference-host checkpoint

The slowest current Rust-only path under review was po-update/catalog-icu-heavy/ferrocat. Its budget is no worse than the same 20% PR regression band. If it trips the guard or a same-host local run drops below roughly 40 MiB/s, profile the path before changing the threshold. The first profiling pass should separate repeated ICU parsing/projection, plural category resolution, allocation/cloning, serialization, and catalog normalization costs.

Reproduce A Fast Baseline

For a quick local signal before and after hot-path changes, use the quick official profile:

cargo run --release -p ferrocat-bench -- compare gettext-official-quick-v1 --out target/benchmark/gettext-official-quick-v1.json

Keep the generated JSON report with your notes when a change is performance-motivated. For numbers you intend to publish or compare publicly, use the full gettext-official-v1 profile on the documented reference host.

Reference Host Rules

use one documented benchmark machine for official comparisons
keep Rust, Node, Python, and GNU gettext versions fixed across report runs
minimize background load and network activity during a run
keep the machine on AC power
compare reports only within the same host and toolchain class

Required Tooling

Rust toolchain able to run cargo run -p ferrocat-bench
Node.js plus the packages declared in benchmark/node/package.json
Python 3 plus the packages declared in benchmark/python/requirements.txt
PHP 8.1+ and Composer for the gettext/gettext adapter under benchmark/php
GNU gettext commands msgcat and msgmerge

Suggested setup:

./benchmark/setup.sh

If benchmark/python/.venv exists, ferrocat-bench will automatically prefer that interpreter for verify-benchmark-env and compare, so polib does not need to be installed globally.

If you only want the Python side, run:

./benchmark/python/setup.sh

Verify The Environment

Run:

cargo run -p ferrocat-bench -- verify-benchmark-env

This checks the required executables and adapter packages and prints the detected tool versions that will be captured in the report metadata.

Benchmark Profiles

rust-scheduled-v1
- Rust-only scheduled/reporting profile used by the Benchmark Reports workflow
- covers owned/borrowed PO parse, PO stringify, merge, update, catalog storage parse, and ICU parse scenarios
- avoids Node, Python, and GNU gettext adapter setup so the scheduled run stays about Ferrocat internals
gettext-official-v1
- the smallest official benchmark profile
- intentionally benchmark-focused rather than test-focused
- one conservative primary locale: de
- one second normal locale: fr
- one more complex plural locale: pl
- one representative large corpus size per scenario
gettext-official-quick-v1
- the fast companion to gettext-official-v1
- keeps the same fixture and external-tool matrix
- lowers the minimum sample duration
- uses fewer warmup and measured runs
- useful for local iteration and regression checks, but not the publication-grade profile
gettext-compat-v1
- extended external benchmark suite
- broader gettext-only matrix with additional locale/family coverage
- useful when you want more detail than the slim official profile
gettext-workflows-ecosystem-v1
- extended workflow suite for classic gettext merge paths
- compares merge_catalog against msgmerge, pofile, pofile-ts, and polib
- every tool runs the same file-to-file pipeline: parse existing.po, parse template.pot, merge, and serialize. Ferrocat reads the same .pot instead of pre-structured messages, so the update comparison stays strictly apples-to-apples
- the validation reparse used for the equivalence digest runs once outside the timed loop for every implementation, so no tool is charged for it
- useful when you want workflow numbers across the broader gettext ecosystem
serious-v1
- advanced/internal benchmark suite
- mixed and ICU-heavy workloads
- useful for ferrocat's broader performance direction, but not the official cross-tool gettext baseline
catalog-update-v1
- catalog update/merge comparison against the real, maintained update tools
- compares ferrocat merge against GNU msgmerge and Python babel's Catalog.update on de/fr
- babel runs its real Catalog.update with no_fuzzy_matching=True to stay aligned with ferrocat's exact-identity merge model

Run The Official Gettext Suite

Use the checked-in gettext-official-v1 profile and write the report outside the internal performance history:

cargo run --release -p ferrocat-bench -- compare gettext-official-v1 --out benchmark/results/gettext-official-v1-$(date +%Y%m%d-%H%M%S).json

The compare command:

validates semantic equivalence for each comparison group before timing
calibrates iterations per scenario to a minimum sample duration
runs 2 warmups per scenario
records 10 measured samples per parse/stringify scenario
stores raw samples plus aggregated statistics in JSON

For a quicker checkpoint with the same comparison matrix:

cargo run --release -p ferrocat-bench -- compare gettext-official-quick-v1 --out benchmark/results/gettext-official-quick-v1-$(date +%Y%m%d-%H%M%S).json

That profile currently uses:

minimum_sample_millis: 100
1 warmup and 3 measured samples for parse/stringify scenarios
1 warmup and 2 measured samples for workflow scenarios

Use it for faster day-to-day checks. Keep gettext-official-v1 as the primary report for published comparisons.

For GNU gettext CLI scenarios, the report also records an empty-cli-run baseline using a minimal header-only input. This adds:

baseline_elapsed_ns and adjusted sample fields for msgcat / msgmerge
adjusted median statistics alongside the raw end-to-end statistics

The raw timing remains the primary comparison number. The adjusted timing is a secondary estimate for understanding how much of the CLI measurement is fixed overhead and how much is actual fixture work. On the 10k-message corpora that fixed overhead is about 2% of the measured time, so the GNU CLI gap reflects real merge and serialization work rather than process startup or file I/O. This is why the published msgmerge number is not treated as a launch-cost artifact.

For the workflow ecosystem suite (the catalog-update comparison):

cargo run --release -p ferrocat-bench -- compare gettext-workflows-ecosystem-v1 --out benchmark/results/gettext-workflows-ecosystem-v1-$(date +%Y%m%d-%H%M%S).json

That profile compares merge_catalog against:

msgmerge
pofile
pofile-ts
polib

Every tool runs the same file-to-file pipeline (parse existing.po, parse template.pot, merge, serialize), so the catalog-update numbers stay strictly apples-to-apples. The msgmerge path runs with --no-fuzzy-matching, which keeps the comparison close to ferrocat's exact-match merge model instead of pitting it against heuristic fuzzy recovery.

For the broader compatibility/detail suite:

cargo run --release -p ferrocat-bench -- compare gettext-compat-v1 --out benchmark/results/gettext-compat-v1-$(date +%Y%m%d-%H%M%S).json

Use this when you want more fixture variety than the slim official profile provides.

Result Storage

External comparison reports should be written under benchmark/results/
Scheduled Rust-only workflow reports stay attached to the Benchmark Reports workflow run unless they are promoted into a release or publication checkpoint

Current `gettext-official-v1` Shape

gettext-ui-de-10000
gettext-saas-fr-10000
gettext-commerce-pl-10000

External baselines currently wired:

polib, pofile, pofile-ts, and gettext-parser on the classic gettext parse/stringify corpora: gettext-ui-de-10000, gettext-saas-fr-10000, gettext-commerce-pl-10000
msgcat on stringify comparisons
msgmerge on the conservative merge corpus, with --no-fuzzy-matching
ferrocat internal owned vs borrowed parse baselines on de, fr, and pl

Workflow-only baselines currently wired:

pofile, pofile-ts, and polib on gettext-workflows-ecosystem-v1
each measured as parse -> merge -> serialize pipelines on gettext-ui-de-1000 and gettext-ui-de-10000
gettext-parser is intentionally excluded from workflow benchmarking for now because its PO compile/parse path does not preserve obsolete entries in a way that is semantically fair for msgmerge-style workflows
update_catalog is intentionally excluded from the public cross-tool benchmark tables because it is a broader catalog-maintenance API without a clean direct equivalent in the external comparison set

This is intentional. The official profile answers the small, understandable benchmark question first. The broader gettext-compat-v1 profile is available when you want more detail, and the advanced mixed-* / ICU-heavy corpora remain separate from the official Gettext comparison track.

Reporting Expectations

When you share benchmark results from the external suite, include the environment block from the JSON report together with the throughput table. At minimum, keep these fields visible:

system_label
os
cpu_model
memory_bytes
rustc_version
node_version
python_version
msgcat_version / msgmerge_version when GNU gettext numbers are shown

This keeps published numbers tied to the machine and toolchain they were measured on without exposing private hostnames.