Optimizers¶
VQE Optimizers¶
This document summarizes the classical optimizers available through vqe.optimizer and used by the VQE workflows in this repository.
The implemented optimizer names are:
AdamGradientDescentMomentumNesterovMomentumRMSPropAdagrad
These are exposed through the registry OPTIMIZERS, where each canonical name maps
to:
the PennyLane optimizer factory
a calibrated default
stepsizeaccepted aliases
The calibrated defaults currently used by run_vqe() when stepsize is omitted are:
Adam:0.15GradientDescent:0.10Momentum:0.10NesterovMomentum:0.20RMSProp:0.01Adagrad:0.10
The main helpers are:
get_optimizer(name: str = "Adam", stepsize: float | None = None)
get_optimizer_stepsize(name: str = "Adam") -> float
and are used inside the main VQE loop via a unified PennyLane interface.
Scope and Optimization Objective¶
In VQE, the classical optimization problem is
$$ \min_{\theta \in \mathbb{R}^d} E(\theta), $$
where
$$ E(\theta) = \langle \psi(\theta) | H | \psi(\theta) \rangle. $$
Definitions:
$\theta = (\theta_1, \dots, \theta_d)^\top \in \mathbb{R}^d$ is the vector of variational circuit parameters
$d$ is the number of trainable parameters in the ansatz
$|\psi(\theta)\rangle$ is the parameterized trial state prepared by the ansatz
$H$ is the qubit Hamiltonian for the chosen molecule, basis, and fermion-to-qubit mapping
$E(\theta)$ is the VQE energy objective in Hartree (Ha)
At optimization step $t$, let:
$\theta_t$ denote the current parameter vector
$g_t = \nabla_\theta E(\theta_t)$ denote the gradient of the energy with respect to the parameters
$\eta > 0$ denote the stepsize (learning rate), passed in this repository as
stepsize
A generic first-order optimizer updates parameters according to
$$ \theta_{t+1} = \theta_t + \Delta \theta_t, $$
where the update $\Delta \theta_t$ depends on the optimizer-specific rule.
Repository Context¶
In vqe.core.run_vqe, the optimizer is constructed as
resolved_stepsize = (
get_optimizer_stepsize(str(optimizer_name))
if stepsize is None
else float(stepsize)
)
opt = engine_build_optimizer(str(optimizer_name), stepsize=resolved_stepsize)
and then applied through either
params, cost = opt.step_and_cost(energy_qnode, params)
or, if unavailable,
params = opt.step(energy_qnode, params)
So the role of the optimizer is purely classical:
evaluate the VQE objective $E(\theta_t)$
compute or use gradient information
update the parameter vector $\theta_t \to \theta_{t+1}$
The quantum part prepares states and measures energies; the optimizer determines how the parameters move across iterations.
Common Notation¶
The formulas below use the following symbols.
$t \in {0,1,2,\dots}$: optimization iteration index
$\theta_t \in \mathbb{R}^d$: parameter vector at iteration $t$
$g_t = \nabla_\theta E(\theta_t) \in \mathbb{R}^d$: gradient at iteration $t$
$\eta > 0$: global stepsize / learning rate
$\odot$: elementwise (Hadamard) product
$g_t^2$: elementwise square of the gradient vector
$\sqrt{v_t}$: elementwise square root of a vector $v_t$
$\epsilon > 0$: small numerical stabilizer to avoid division by zero
$\beta, \beta_1, \beta_2 \in [0,1)$: decay or momentum hyperparameters
$m_t \in \mathbb{R}^d$: first-moment / momentum accumulator
$v_t \in \mathbb{R}^d$: second-moment / squared-gradient accumulator
Unless otherwise stated, vector divisions are elementwise.
1. Gradient Descent¶
Update rule¶
Gradient Descent uses the negative gradient direction directly:
$$ \theta_{t+1} = \theta_t - \eta g_t. $$
Variables¶
$\theta_t$: current parameters
$g_t$: gradient of the VQE energy at the current parameters
$\eta$: fixed stepsize
Interpretation¶
This is the simplest optimizer in the repository. It moves in the steepest local descent direction with a constant learning rate.
Strengths¶
simplest and most interpretable baseline
useful for pedagogical comparisons
minimal internal state
Limitations¶
highly sensitive to the choice of $\eta$
can zig-zag in narrow valleys
can be slow on ill-conditioned landscapes
may stall or oscillate if the energy landscape has very different curvature scales across parameters
Typical use in this repo¶
Gradient Descent is mainly a baseline for optimizer comparisons rather than the default production choice.
2. Momentum¶
Update rule¶
Momentum augments Gradient Descent with a velocity-like running average of past gradients. A standard form is
$$ m_t = \beta m_{t-1} + g_t, $$
$$ \theta_{t+1} = \theta_t - \eta m_t. $$
Variables¶
$m_t$: momentum accumulator at iteration $t$
$m_{t-1}$: previous momentum accumulator
$\beta \in [0,1)$: momentum coefficient
$g_t$: current gradient
$\eta$: stepsize
Interpretation¶
Momentum smooths the update sequence by retaining part of the previous descent direction. This can accelerate motion along persistent downhill directions and suppress some oscillations.
Strengths¶
often faster than plain Gradient Descent
can reduce oscillatory behaviour
useful on elongated or shallow valleys
Limitations¶
still requires tuning of $\eta$
can overshoot minima if momentum is too strong
less robust than adaptive methods on heterogeneous parameter scales
Typical use in this repo¶
Momentum is useful in optimizer-comparison studies where one wants to see whether simple inertial updates improve VQE convergence relative to plain Gradient Descent.
3. Nesterov Momentum¶
Update rule¶
Nesterov momentum is a momentum-based method with a look-ahead correction. A standard conceptual form is:
build a look-ahead point $$ \tilde{\theta}t = \theta_t - \eta \beta m{t-1} $$
evaluate the gradient at the look-ahead point $$ \tilde{g}t = \nabla\theta E(\tilde{\theta}_t) $$
update momentum and parameters $$ m_t = \beta m_{t-1} + \tilde{g}_t $$
$$ \theta_{t+1} = \theta_t - \eta m_t $$
Variables¶
$\tilde{\theta}_t$: look-ahead parameter point
$\tilde{g}_t$: gradient evaluated at the look-ahead point
$m_t$: momentum accumulator
$\beta$: momentum coefficient
$\eta$: stepsize
Interpretation¶
Instead of evaluating the gradient exactly at the current point, Nesterov momentum evaluates it after a partial extrapolation in the current momentum direction. This often gives a more anticipatory update.
Strengths¶
can converge faster than standard momentum in smooth problems
often improves directional correction
reduces some forms of overshooting relative to naive momentum
Limitations¶
behaviour still depends on step-size tuning
benefit can be modest on noisy or irregular objective landscapes
more difficult to reason about than plain Gradient Descent
Typical use in this repo¶
NesterovMomentum is a useful intermediate option between simple momentum methods and more adaptive optimizers such as Adam.
4. Adagrad¶
Update rule¶
Adagrad rescales each parameter update using the history of squared gradients:
$$ v_t = v_{t-1} + g_t^2, $$
$$ \theta_{t+1} = \theta_t - \eta \frac{g_t}{\sqrt{v_t} + \epsilon}. $$
Variables¶
$v_t$: accumulated elementwise sum of squared gradients up to step $t$
$g_t^2$: elementwise square of the current gradient
$\epsilon$: small positive stabilizer
$\eta$: base stepsize
Interpretation¶
Parameters that repeatedly experience large gradients receive smaller future effective step sizes, while parameters with smaller accumulated gradients receive relatively larger effective steps.
Strengths¶
automatically adapts per-parameter learning rates
useful when different parameters evolve on very different scales
can stabilize optimization early in training
Limitations¶
$v_t$ grows monotonically, so effective learning rates continually shrink
may become overly conservative in longer optimization runs
can stop making meaningful progress if the accumulated denominator becomes too large
Typical use in this repo¶
Adagrad can be informative in VQE studies where parameter sensitivities are highly nonuniform, but it is usually not the first default choice for sustained optimization.
5. RMSProp¶
Update rule¶
RMSProp modifies Adagrad by replacing the cumulative squared-gradient sum with an exponentially weighted moving average:
$$ v_t = \beta v_{t-1} + (1-\beta) g_t^2, $$
$$ \theta_{t+1} = \theta_t - \eta \frac{g_t}{\sqrt{v_t} + \epsilon}. $$
Variables¶
$v_t$: exponential moving average of squared gradients
$\beta \in [0,1)$: decay rate for the second-moment estimate
$g_t$: current gradient
$\eta$: base stepsize
$\epsilon$: numerical stabilizer
Interpretation¶
RMSProp keeps adaptive per-parameter scaling like Adagrad, but avoids the permanently shrinking learning-rate problem by discounting old gradient information.
Strengths¶
more stable than plain Gradient Descent on uneven landscapes
often better behaved than Adagrad over longer runs
adaptive scaling can help when gradients vary strongly across parameters
Limitations¶
still requires stepsize tuning
can be less robust than Adam in practice
does not include an explicit first-moment momentum term in the basic formulation
Typical use in this repo¶
RMSProp is a reasonable adaptive alternative when one wants per-parameter learning-rate normalization without moving to the fuller Adam update.
6. Adam¶
Update rule¶
Adam combines momentum-like first-moment tracking with RMSProp-style second-moment adaptation.
First moment estimate:
$$ m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t $$
Second moment estimate:
$$ v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 $$
Bias-corrected estimates:
$$ \hat{m}_t = \frac{m_t}{1-\beta_1^t}, \qquad \hat{v}_t = \frac{v_t}{1-\beta_2^t} $$
Parameter update:
$$ \theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}. $$
Variables¶
$m_t$: exponential moving average of gradients (first moment)
$v_t$: exponential moving average of squared gradients (second moment)
$\beta_1 \in [0,1)$: first-moment decay rate
$\beta_2 \in [0,1)$: second-moment decay rate
$\hat{m}_t$: bias-corrected first-moment estimate
$\hat{v}_t$: bias-corrected second-moment estimate
$\epsilon$: small stabilizer
$\eta$: learning rate
Interpretation¶
Adam combines three useful ideas:
momentum-like smoothing through $m_t$
adaptive per-parameter scaling through $v_t$
bias correction to compensate for zero initialization of the moment estimates at early iterations
Strengths¶
often robust across a wide range of VQE settings
tends to work well without extensive tuning
handles heterogeneous gradient scales better than plain Gradient Descent
commonly a strong default for small and medium variational problems
Limitations¶
can sometimes converge to slightly less clean final minima than carefully tuned simpler methods
still sensitive to overly aggressive stepsizes
the most complex update rule among the implemented first-order optimizers here
Typical use in this repo¶
Adam is the default optimizer in this repository and is generally the first choice for standard VQE runs.
Practical Comparison¶
The optimizers in this repository can be grouped roughly as follows.
Fixed-step methods¶
GradientDescentMomentumNesterovMomentum
These methods use a global learning rate $\eta$ without adaptive per-parameter normalization. They are simple and interpretable, but usually more sensitive to hyperparameter tuning.
Adaptive methods¶
AdagradRMSPropAdam
These methods rescale updates using gradient-history information. They are often more robust when the variational parameters have different sensitivities or when the local energy landscape is poorly conditioned.
Which Optimizer Should Be Tried First?¶
For most standard VQE experiments in this repository:
start with
Adamuse
GradientDescentas a baseline if you want a clean referencetry
MomentumorNesterovMomentumif you want simple inertial alternativestry
RMSProporAdagradwhen parameter scales appear uneven
A practical rule of thumb is:
Adam for general-purpose default use
GradientDescent for interpretability and baseline studies
Momentum / NesterovMomentum for simple acceleration over GD
RMSProp / Adagrad for stronger per-parameter adaptation
Relation to Noise and Ansatz Choice¶
The optimizer acts on the classical objective produced by the quantum circuit, so its behaviour depends indirectly on:
the ansatz family
the number of trainable parameters
the molecular Hamiltonian
the fermion-to-qubit mapping
whether the circuit execution is noiseless or noisy
In particular:
a more expressive ansatz may give a lower reachable minimum but a harder optimization landscape
noisy execution can distort gradients or make convergence less smooth
different mappings can change Pauli structure and therefore the measured objective landscape
different optimizers may respond differently to the same VQE instance
This is why the repository includes dedicated optimizer-comparison workflows such as run_vqe_optimizer_comparison.
Notes on Exact Hyperparameters¶
This repository passes only the user-facing stepsize explicitly through the optimizer factory:
get_optimizer(name, stepsize)
Other internal optimizer hyperparameters (e.g. momentum or decay coefficients) are not exposed in this repository and are inherited directly from the underlying PennyLane implementations.
So:
the formulas in this document describe the mathematical update structure
the concrete default hyperparameter values beyond
stepsizeare determined by PennyLane unless the implementation is extended in future
Implemented Name Mapping¶
The current optimizer registry is:
OPTIMIZERS = {
"Adam": {"factory": qml.AdamOptimizer, "stepsize": 0.15, "aliases": ("adam",)},
"GradientDescent": {
"factory": qml.GradientDescentOptimizer,
"stepsize": 0.10,
"aliases": ("gradientdescent", "gradient_descent", "gd"),
},
"Momentum": {
"factory": qml.MomentumOptimizer,
"stepsize": 0.10,
"aliases": ("momentum",),
},
"NesterovMomentum": {
"factory": qml.NesterovMomentumOptimizer,
"stepsize": 0.20,
"aliases": ("nesterov", "nesterovmomentum"),
},
"RMSProp": {"factory": qml.RMSPropOptimizer, "stepsize": 0.01, "aliases": ("rmsprop",)},
"Adagrad": {"factory": qml.AdagradOptimizer, "stepsize": 0.10, "aliases": ("adagrad",)},
}
So the canonical user-facing names are:
AdamGradientDescentMomentumNesterovMomentumRMSPropAdagrad
Accepted aliases also include:
adamgdnesterov
Summary Table¶
Optimizer |
Core idea |
Uses momentum? |
Adaptive per-parameter scaling? |
Main trade-off |
|---|---|---|---|---|
|
direct negative-gradient update |
No |
No |
simple but stepsize-sensitive |
|
gradient descent with velocity accumulation |
Yes |
No |
faster than GD, can overshoot |
|
momentum with look-ahead gradient |
Yes |
No |
often sharper updates, still tuning-sensitive |
|
cumulative squared-gradient scaling |
No |
Yes |
adapts well early, can become too conservative |
|
moving-average squared-gradient scaling |
No |
Yes |
adaptive and stable, but less full-featured than Adam |
|
momentum + adaptive second moment + bias correction |
Yes |
Yes |
robust default, but more complex |
See Also¶
THEORY.md— high-level VQE theory and algorithm contextvqe/optimizer.py— optimizer factory and supported namesvqe/core.py— main VQE loop using the optimizerUSAGE.md— practical usage and CLI examples