I would like to get some help in understanding what's going on there.
A couple of weeks ago I completed work on the kernels themselves but due to other priorities I only had the chance to integrate and test them to a remote server only yesterday.
I have some further ideas on how to improve my implementation but the results I found so far are interesting and on my hardware (Capeverde) I have a sizeable advantage so I'll stop here for a while.
Yescrypt is supposed to be GPU resistant by requiring a ton and half of bandwidth and being latency-bound. At its inner core, it does something like:
- Given a blob of data (S, being an "s-box") taking 8KiB in this case;
- Derive two pointers ubyte *s0, *s1, being 4KiB away from each other, I have found references around about those being far to force decreased cache efficiency being one page away;
- For each pair of ulongs x0, x1, taken consecutively from the 'state slice' being processed (8 ulongs per slice), do (this is a for, but I do it in parallel in my kernels)
- Derive from the first ulong two indices i0, i1.
- Load 2x2 ulongs, the first pair starting from s0+i0, the second from s1+i1. Yes, it is unaligned access but I load them as ubytes so there should be no problem for me right?
- Those loaded ulongs are xorred or added to x0, x1.
The above is the "parallel wide transform" of a "8 ulongs block" being Block_pwxform, which should really be called Slice_pwxform as far as I am concerned but that's it.
Being there basically no ALU intensity, I expected this to be horribly slow. Indeed, it is, compared to an high-end CPU but the readings (as by CodeXL) are still not what I would expect.
- As a start, I have measured VALUBusy > 70% where I would have expected no more than half that number, in the best case, this kernel gets occupancy 50%. Is GCN really so cool it can get along with just an handful of ALU instructions? It was my understanding I needed hundreds of them... and indeed I have thousands of instructions but I haven't had the chance to dive in ASM yet.
- SALUBusy by contrast is a lot lower than I expected (~2%). More on that below. It appears the compiler is just too conservative in using the VALU. What pattern can I use to help the poor thing understand?
- Even though I use at worse 60% MemUnitBusy with read and write stalls being close to zero (in the 90% time kernels), I consume minuscule amounts of bandwidth. I'd say that's just kernel design working as expected (we have been fetching bumpy-shiny RGBA32 texels for a while after all ) but if the memory units are not saturated... shouldn't I be reaching higher perf? I still suspect I don't properly understand CodeXL readings (do I have to cross-post in CodeXL section?). I was expecting this to be >90% in at least Busy.
How is the GPU hiding latency here? In theory there shouldn't be so much stuff to do.
One of my theories is that the VALU is using the Salsa20_8 operation. Let me elaborate.
The above Block_pwxform is run sequentially over 128 consecutive slices (of 8 ulongs each), this operation being Blockmix_pwxform. This is strictly sequential, as each n+1 slice needs to be xorred with the nth as resulted after the S-box manipulation. After all the slices are gone, the last slice gets Salsa'd.
In my implementation, I used the 64WIs of a wavefront to load up bytes of the various consecutive ulong pairs, dispatching 64WIs for each "true" element I'm processing. This salsa instead is driven from a constant expression. My intention was to have the compiler ideally shut down the VALU and go SALU completely.
Perhaps this does not happen as I use LDS extensively here. I'd be glad if anyone could try this on Tonga, perhaps the SALU will trigger.
AFAIK, the VALU cannot be really shut down... and because of the way GCN works it's not like it can switch to another wavefront in the meanwhile but I assume there's a way to save some wattage by using some dummy VALU instruction (I think I've only seen S_NOP in the disassembly).
So... nothing really. I am doing myself questions on what it turned out a quite interesting experiment. Opinions welcome. I have decided to post it there after AMD blog posts about hash trees of a couple of months ago.
- I am still doing some adjustments, the public version should be build-able in a matter of days.
- If you want to take a look at how the buffers are sized and how the kernels are used, I have a data-descriptive language in place you can probably guess.
- The access patterns used are more or less as the CPU would go with. I haven't tried experimenting with them yet. I don't even think there's much I can do in the performance path. I don't even think it matters as the wavefront is more or less packed. Opinions welcome!
- I suspect I could be wasting a bazillion instructions in just LDS addressing.
Edit: added note 4.