Here is the suggested response from the relevant team:
“One wavefront is serviced at a time (over some number of clocks), so it’s best if wavefronts fetch one or more entire cachelines to get peak L1$ bandwidth.”
Does "some number of clocks", mean a variable number, or is it a fixed number (say 4) and they just aren't being specific?
If it's variable, does that mean all (up to 64) memory reads that have returned from the L2 will be serviced before another wavefront is serviced?
In other words, if a wavefront has executed FLAT_LOAD_DWORD, and each of the 64 threads is loading from a different random address in memory, will it take 64 continuous cycles to service that wavefront?