I would like to get some help in understanding what's going on there.
A couple of weeks ago I completed work on the kernels themselves but due to other priorities I only had the chance to integrate and test them to a remote server only yesterday.
I have some further ideas on how to improve my implementation but the results I found so far are interesting and on my hardware (Capeverde) I have a sizeable advantage so I'll stop here for a while.
Yescrypt is supposed to be GPU resistant by requiring a ton and half of bandwidth and being latency-bound. At its inner core, it does something like:
The above is the "parallel wide transform" of a "8 ulongs block" being Block_pwxform, which should really be called Slice_pwxform as far as I am concerned but that's it.
Being there basically no ALU intensity, I expected this to be horribly slow. Indeed, it is, compared to an high-end CPU but the readings (as by CodeXL) are still not what I would expect.
How is the GPU hiding latency here? In theory there shouldn't be so much stuff to do.
One of my theories is that the VALU is using the Salsa20_8 operation. Let me elaborate.
The above Block_pwxform is run sequentially over 128 consecutive slices (of 8 ulongs each), this operation being Blockmix_pwxform. This is strictly sequential, as each n+1 slice needs to be xorred with the nth as resulted after the S-box manipulation. After all the slices are gone, the last slice gets Salsa'd.
In my implementation, I used the 64WIs of a wavefront to load up bytes of the various consecutive ulong pairs, dispatching 64WIs for each "true" element I'm processing. This salsa instead is driven from a constant expression. My intention was to have the compiler ideally shut down the VALU and go SALU completely.
Perhaps this does not happen as I use LDS extensively here. I'd be glad if anyone could try this on Tonga, perhaps the SALU will trigger.
AFAIK, the VALU cannot be really shut down... and because of the way GCN works it's not like it can switch to another wavefront in the meanwhile but I assume there's a way to save some wattage by using some dummy VALU instruction (I think I've only seen S_NOP in the disassembly).
So... nothing really. I am doing myself questions on what it turned out a quite interesting experiment. Opinions welcome. I have decided to post it there after AMD blog posts about hash trees of a couple of months ago.
Final notes:
Edit: added note 4.