* Twelve rounds of G mixing as part of BLAKE2b compression step. Normally,
* each round is divided into eight subprocesses; NanoPow compresses these
* operations into four subprocesses by executing sequential pairs
- * simultaneously, inspired by https://github.com/minio/blake2b-simd
+ * simultaneously, inspired by https://github.com/minio/blake2b-simd. It then
+ * executes each compressed statement in pairs so that the compiler can
+ * interleave independent instructions and improve scheduling. That is to say,
+ * to execute `a = a + b` for subprocesses 1-4, first 1 is paired with 2 and 3
+ * is paired with 4; then 1/2 is executed and 3/4 is executed; then the next
+ * computation `a = a + m[sigma[r][2*i+0]]` is executed in the same manner, and
+ * so on through all the steps of the subprocess.
*
* Each subprocess applies transformations to to `m` and `v` variables based on
* a defined set of index inputs. The algorithm for each subprocess is defined
* Each sum step has an extra carry addition. Note that the m[sigma] sum is
* skipped if m[sigma] is zero since it effectively does nothing.
*/
- var a: vec4<u32>;
- var b: vec4<u32>;
- var c: vec4<u32>;
- var d: vec4<u32>;
- var x: vec4<u32>;
- var y: vec4<u32>;
var v56: vec4<u32>;
var vFC: vec4<u32>;
var v74: vec4<u32>;