Benchmark Utilities¶
The qml.benchmarks module compares quantum and classical models across
multiple random seeds under shared dataset settings. The helpers are intended
for reproducible model-quality comparisons, not claims of quantum advantage.
Benchmark outputs include:
- train/test metrics
- runtime totals
- fit/predict timing breakdowns when workflows expose them
- train/test generalization gaps
- circuit metadata for circuit-backed workflows when available
- mean, standard deviation, and 95% confidence intervals
- paired deltas against the best included classical baseline
- tuning metadata for tuned classical baselines
- environment metadata for reproducibility
Both classification and regression workflows are supported.
Classification Benchmarks¶
Supported models:
vqcqcnnquantum_kerneltrainable_quantum_kernelquantum_metric_learningquantum_reservoirlogistic_regressionsvm_classifiermlp_classifierrandom_forest_classifiergradient_boosting_classifierknn_classifiergaussian_process_classifier
Example:
from qml.benchmarks import compare_classification_models
result = compare_classification_models(
models=[
"vqc",
"quantum_kernel",
"svm_classifier",
"random_forest_classifier",
"logistic_regression",
],
seeds=[0, 1, 2, 3],
n_samples=200,
noise=0.1,
)
Each classification run record includes:
{
"model": "vqc",
"seed": 0,
"dataset": "moons",
"train_accuracy": ...,
"test_accuracy": ...,
"generalization_gap": ...,
"runtime_seconds": ...,
"timing": {"total_seconds": ...},
"final_loss": ...,
}
For classification, generalization_gap is:
train_accuracy - test_accuracy
Regression Benchmarks¶
Supported models:
vqrquantum_kernel_regressortrainable_quantum_kernel_regressorquantum_gaussian_process_regressorquantum_reservoir_regressorridge_regressionmlp_regressorkernel_ridge_regressionsvr_regressiongaussian_process_regressorrandom_forest_regressorgradient_boosting_regressorknn_regressorlasso_regressionelasticnet_regression
Example:
from qml.benchmarks import compare_regression_models
result = compare_regression_models(
models=["vqr", "ridge_regression", "kernel_ridge_regression", "svr_regression"],
seeds=[0, 1, 2],
n_samples=200,
noise=0.1,
)
For regression, generalization_gap is:
test_mse - train_mse
Result Structure¶
Returned benchmark dictionaries include:
{
"benchmark_type": "classification",
"models": [...],
"seeds": [...],
"dataset": "moons",
"runs": [...],
"summary": {
"svm_classifier": {
"test_accuracy": {
"mean": ...,
"std": ...,
"n": ...,
"ci95_low": ...,
"ci95_high": ...,
},
"runtime_seconds": {...},
"generalization_gap": {...},
"trainable_parameters": {...},
"estimated_depth": {...},
"n_runs": 4,
}
},
"best_model": {
"model": "svm_classifier",
"metric": "test_accuracy",
"value": ...,
"higher_is_better": True,
},
"paired_vs_best_classical": {...},
"metadata": {...},
}
The best_model field is selected from the aggregate test metric:
- classification: highest mean
test_accuracy - regression: lowest mean
test_mse
Use it as a convenience summary only. Always inspect run records, confidence intervals, paired classical deltas, and runtime before drawing conclusions.
Runtime Scaling Benchmarks¶
Use benchmark_runtime_scaling(...) to sweep sample counts and shot counts while
keeping the normal classification or regression benchmark result structure:
from qml.benchmarks import benchmark_runtime_scaling
result = benchmark_runtime_scaling(
task="classification",
models=["quantum_reservoir", "logistic_regression"],
sample_sizes=[50, 100, 150],
shots_values=[None, 128],
seeds=[0, 1, 2],
dataset="moons",
model_kwargs={"quantum_reservoir": {"n_layers": 1}},
)
The returned dictionary includes:
configurations: the nested benchmark output for each sample/shot settingscaling_summary: one flat row per model and setting, with primary metric mean, confidence interval, runtime mean, confidence interval, and circuit metadata summaries when available
For classification, the primary metric is test_accuracy. For regression, the
primary metric is test_mse.
The matching CLI preset is:
qml-pennylane benchmark runtime-scaling \
--task classification \
--models quantum_reservoir logistic_regression \
--sample-sizes 50 100 150 \
--shots analytic 128 \
--seeds 0 1 2
Runtime sweeps are still smoke-scale unless you increase seeds, sample sizes, and model-specific training settings. Treat the output as scaling diagnostics, not a full performance claim.
Paired Classical Comparison¶
paired_vs_best_classical compares each selected model against the best
included classical baseline on matching seeds. It reports:
- reference classical model
- seed-wise mean delta and confidence interval
- win/loss/tie counts
- number of paired seeds
This is the preferred summary for evaluating whether a QML model improves over classical references in the same benchmark call.
Classical Hyperparameter Tuning¶
Classical baselines can be tuned with small GridSearchCV defaults:
result = compare_classification_models(
models=["quantum_kernel", "svm_classifier", "random_forest_classifier"],
seeds=[0, 1, 2],
tune_classical=True,
cv=3,
)
Per-model grids can be overridden through model_kwargs:
result = compare_regression_models(
models=["vqr", "kernel_ridge_regression", "svr_regression"],
tune_classical=True,
model_kwargs={
"kernel_ridge_regression": {
"param_grid": {
"alpha": [0.01, 0.1, 1.0],
"kernel": ["rbf"],
"gamma": [0.1, 1.0],
}
}
},
)
Quantum model tuning is supplied explicitly through model_kwargs, for example
by sweeping layers, optimizer steps, step size, batch_size, shots, or kernel
settings across separate benchmark calls.
Datasets¶
Classification datasets:
moons
circles
blobs
xor
linear
breast_cancer
wine
Regression datasets:
linear
sine
polynomial
friedman
diabetes
CLI Benchmark Presets¶
Run benchmark summaries directly from the package command:
qml-pennylane benchmark classification \
--models vqc quantum_kernel quantum_reservoir svm_classifier \
--seeds 0 1 2 \
--samples 100
qml-pennylane benchmark regression \
--models vqr quantum_kernel_regressor quantum_reservoir_regressor ridge_regression \
--seeds 0 1 2 \
--samples 100
Finite-shot sweeps compare analytic execution against selected shot counts:
qml-pennylane benchmark finite-shots \
--shots analytic 64 128 512 \
--seeds 0 1 2 \
--samples 70
The benchmark notebook suite also includes a noise-model sweep that compares noiseless execution against depolarizing, amplitude-damping, readout-error, and combined low-noise settings for representative classification and regression QML workflows. Use it as a template for controlled hardware-error sensitivity studies by increasing seeds, sample counts, and channel probabilities.
The runtime-scaling preset sweeps sample counts and optional finite-shot settings:
qml-pennylane benchmark runtime-scaling \
--task regression \
--models quantum_reservoir_regressor ridge_regression \
--sample-sizes 50 100 150 \
--shots analytic
The real-data options are projected to two features so they remain compatible with the compact quantum examples and visualizers. They are useful sanity checks, not substitutes for domain-specific benchmark suites.
CLI Usage¶
Classification benchmark:
python -m qml benchmark classification \
--models vqc qcnn quantum_kernel svm_classifier random_forest_classifier \
--seeds 123 456 789 \
--tune-classical
Regression benchmark:
python -m qml benchmark regression \
--models vqr ridge_regression kernel_ridge_regression svr_regression \
--seeds 123 456 \
--tune-classical
Default settings:
- samples: 200
- noise: 0.1
- test split: 0.25
- seed: 123
- classical tuning: disabled unless
--tune-classicalis provided
Saving Results¶
Results can be saved to:
results/benchmarks/
Saved JSON includes individual run records, aggregate metrics, timing summaries, paired classical comparisons, tuning metadata, environment metadata, and the dataset configuration.
Interpretation Checklist¶
Before publishing a benchmark table, record:
- model list and model-specific kwargs
- dataset name, sample count, split, noise level, and seed list
- analytic or finite-shot execution settings
- noise-model settings when noisy channels are enabled
- package, Python, scikit-learn, and PennyLane versions
- classical baselines and tuning grids
- mean, standard deviation, and confidence intervals
- paired deltas against the best classical baseline
- runtime totals and fit/predict breakdowns
- train/test generalization gaps
Benchmarks in this package make comparisons reproducible and auditable. They do not establish quantum advantage by themselves.