GPU Porting Guide¶
The current code runs on the CPU Aer statevector simulator.
The NREL HPC module qiskit/aer-gpu exposes AerSimulator with device='GPU'
using NVIDIA cuStateVec (cuQuantum) for GPU-accelerated simulation.
Three-Tier Strategy¶
Tier 1 — Drop-in backend swap (immediate, ~1 line per location)¶
Replace every Aer.get_backend(...) call:
# BEFORE
from qiskit_aer import Aer
simulator = Aer.get_backend('statevector_simulator')
# or
simulator = Aer.get_backend('aer_simulator_statevector')
# AFTER
from qiskit_aer import AerSimulator
simulator = AerSimulator(method='statevector', device='GPU')
Locations marked with # GPU-SWAP TIER 1 in the codebase:
qiskit_impl/binary_optimizer.py→execute_optimizer(),execute_optimizer_oracle(),execute_qae()QuantumExpectedValueFunctionProject/dense_optimizer.py→ allsolveAnnealing*()methods
Expected speedup: 5–50× for circuits with \(n \geq 20\) qubits. Risk: zero — pure backend swap.
Tier 2 — Transpile once + ParameterVector binding¶
Inside DQA time-step loops, sub-circuits are currently rebuilt with numeric angles each iteration.
The fix — demonstrated in bayesianQC/optimize_10epoch_performance.py:
from qiskit.circuit import Parameter
from qiskit.transpiler.preset_passmanagers import generate_preset_pass_manager
# Build symbolic template ONCE
s = Parameter('s')
cost_tmpl = cost_operator_symbolic(s, ...)
# Transpile ONCE to GPU target
pm = generate_preset_pass_manager(optimization_level=1, backend=gpu_simulator)
cost_t = pm.run(cost_tmpl)
# Inside loop: only bind numbers (cheap)
for t in range(time_steps):
s_val = (dt * t) / time
qc.append(cost_t.assign_parameters({s: s_val}), qubits)
Locations marked with # PARAM-TRANSPILE in the codebase.
Tier 3 — cuStateVec (>20 qubits)¶
simulator = AerSimulator(
method='statevector', device='GPU',
cuStateVec_enable=True, # NVIDIA cuStateVec via cuquantum-cu12
blocking_enable=True, # auto-chunk if VRAM insufficient
blocking_qubits=23,
)
Benchmarking Plan¶
- Run
brute_force_energy_surface()as classical baseline - For
n_yin [3, 4, 5, 6, 7, 8]: timeexecute_optimizer()on CPU vs GPU - For
m_qaein [3, 4, 5, 6, 7]: timeexecute_qae()on CPU vs GPU - Plot speedup vs
n_qubitsand vs circuit depth - Report: max speedup, crossover point (where GPU beats CPU)