http://benji3up2kxewkqfcq7buxk2xd6zwy3zggnurkrm3l4cvwy2iipvyyad.onion/mirrors/gmpdoc/Assembly-Loop-Unrolling.html
This might take a lot of code, but may be
the best way to optimize all cases in combination with a deep pipelined loop. A computed jump into the middle of the loop, thus making the first iteration
handle the excess. This should make times smoothly increase with size, which
is attractive, but setups for the jump and adjustments for pointers can be
tricky and could become quite difficult in combination with deep pipelining.