Speed up and simplify SR
Created by: inailuig
This implements the non-centered Sv = ⟨Ô†ΔÔ⟩v with just 1 jvp + 1 vjp
, instead of 1 jvp + 2 vjp
all while doing the same amount of communication.
Results in a measurable speedup compared to before and also compared to the centered version (benchmarking just the matrix-vector product).
Also fixes and simplifies a few things and extends the test.
A derivation is coming soon™.