Benchmark Interpretation Guide

Benchmarks in this project are reproducible comparison templates. They are designed to make model behavior visible under shared settings, not to establish general quantum advantage.

Use the benchmark pages to answer narrow questions:

  • Which included model works best for this small dataset and configuration?
  • Does a QML model improve over the strongest included classical reference?
  • How much quality is traded for runtime or finite-shot execution?
  • Are results stable across seeds, datasets, and model settings?

Required Context

Interpret every benchmark with these fields visible:

  • dataset name and sample count
  • model list
  • seed list
  • train and test metrics
  • confidence intervals or multi-seed summaries
  • generalization gaps
  • runtime summaries
  • circuit metadata when available, especially trainable-parameter count and estimated template depth
  • paired deltas against the best included classical baseline
  • shot count or analytic execution mode
  • tuning metadata for classical references

Single-seed benchmark outputs are smoke checks. Treat them as examples of the reporting format, not as evidence that one model is generally better.

Primary Metrics

Classification benchmarks rank models by mean test_accuracy.

Regression benchmarks rank models by mean test_mse, with lower values better. Secondary metrics such as MAE are useful for checking whether conclusions are consistent across loss functions.

The best_model field is a convenience summary. It should not be the only result used when deciding whether a model is useful.

Confidence Intervals

The benchmark summaries report 95 percent confidence intervals when multiple seed results are available. Wide intervals mean the run is not precise enough to support a strong comparison.

Use intervals this way:

  • overlapping intervals suggest the benchmark is inconclusive
  • narrow intervals make seed-to-seed behavior easier to compare
  • single-run intervals are placeholders around one observed value
  • more seeds are preferred before making release notes or documentation claims

For small notebook defaults, the interval is mostly a reminder to rerun with more seeds before drawing stronger conclusions.

Paired Classical Deltas

paired_vs_best_classical compares each model against the best included classical baseline on matching seeds. This is the most useful summary for QML comparisons because it keeps dataset splits aligned.

For classification, positive deltas mean better test accuracy than the classical reference.

For regression, negative deltas mean lower test error than the classical reference.

Read the paired table together with win/loss/tie counts. A small average delta with mixed wins and losses should be treated as inconclusive.

Generalization Gaps

Generalization gaps indicate how differently a model behaves on train and test splits.

For classification:

train_accuracy - test_accuracy

For regression:

test_mse - train_mse

Large positive gaps can indicate overfitting. Negative gaps can happen on small or noisy splits and should be checked across more seeds.

Runtime Tradeoffs

QML models often have higher runtime at these small sample sizes because the notebooks favor transparent circuit evaluation over optimized deployment.

Runtime should be interpreted with quality:

  • a slower model with no paired metric gain is not a useful default
  • a slower model with consistent paired gains may justify deeper investigation
  • analytic simulators and finite-shot runs should not be compared as if they had the same execution cost
  • runtime scaling benchmarks are needed before making claims about larger data

Circuit metadata adds a second complexity signal. Parameter counts and estimated depths are package-template summaries, not hardware-compiled resource estimates, but they help distinguish a small fixed-feature circuit from a deeper trainable workflow when runtimes or metrics are similar.

The smoke-scale notebooks are intentionally small enough for documentation generation. They are not tuned for maximum model quality.

Finite-Shot Results

Finite-shot benchmarks compare analytic execution with sampled execution at fixed shot counts. Expect additional variance as shots decrease.

Useful signals include:

  • metric degradation from analytic to low-shot settings
  • whether higher shot counts recover analytic behavior
  • whether rankings change under finite-shot execution
  • whether runtime increases enough to change the practical recommendation

Report finite-shot results as robustness checks, not as hardware performance claims.

Tuning and Fairness

Classical baselines may be tuned with small grid-search defaults. Quantum model settings are supplied explicitly through benchmark arguments and model_kwargs.

A comparison is more meaningful when:

  • all models use the same train/test splits
  • classical baselines include at least one strong nonlinear reference
  • model settings are stated
  • seed counts are high enough for stable summaries
  • runtime is reported with the primary metric

If only the default notebook settings were used, describe the result as a smoke-scale comparison.

Release Language

Use careful wording in release notes and docs:

  • "matched the best included classical baseline on this smoke-scale run"
  • "improved the paired test metric in this benchmark configuration"
  • "degraded under low-shot execution"
  • "requires more seeds or larger sweeps before stronger claims"

Avoid wording such as:

  • "quantum advantage"
  • "outperforms classical machine learning" without scope
  • "state of the art"
  • "proves superiority"

The benchmark suite is most useful as a reproducible harness for inspecting model behavior under controlled settings.