Preface: I haven't read the GCN instruction manual yet. I start to believe it's a must however the sheer size of the document scares me a bit and I don't plan to use IL anyway but stay on CL only.
I have a kernel which cannot saturate memory bandwidth but produce very high ALUBusy%, in the order of 90%, with extra 10% likely unused due to occupancy issues. I noticed many operations in this kernel are really independent 4D so I reformulated each work item as a set of 4 work items collaborating using LDS on a 4x16 work group which I need to be on wavefront size to exploit free barriers. From now on, I'll refer to the 4 "x" work items making up an "original" WI as a team.
It is worth to be noted that each WI in the team works of a "channel" of data, but the data is comprised by eight successive elements so each WI needs to iterate on stride 4 uint to load each successive value it will mangle.
As expected, the instruction count for each WI decreased, even considering the team there are often savings >15% with a few notable exceptions.
As a start, VFetchInst (as reported by CodeXL) remained the same on function SW, which means I'm now doing 4x the amount of fetches. On a companion function IR, the amount of fetches decreased slightly, which implies it increased a lot considering a team (about 330%).
I am now thinking about using a LDS-assisted load to load for example a float4-per-WI and then dispatching it to the correct WI in the team. However, I was pretty surprised the compiler couldn't extract the information I was trying to express.
Most importantly, performance turned out to be worse than the original yet bandwidth is not saturated
So, now the questions:
- What is a fetch instruction? From my - admittedly incomplete - understanding I speculate fetches are really always 4D. This somehow makes sense in my head as I'm running a 128bit card, it would make sense to me for a fetch to be a whole memory subsystem reservation.
- Is this expected to happen? I really tried to be very clear in using those pointers, I somehow tried to make sure the compiler would have understood the WIs in the team were loading consecutive 32bit elements - "stride one", as noted as optimal in the AMD APP documentation.
- What I am doing is the opposite of vectorization. I'm really trying to extract more parallelism so all the operations can go to a different ALU in the same SIMD lane and happening in a single clock instead of 4. Is this approach correct? Is it a good practice?
- Considering the performance seems to go along with fetch, why are those instructions so expensive? Thinking twice, I admit they aren't as they seem to get dispatched as a standard instruction yet the amount of instructions generated somehow makes them "expensive" from a 10,000 ft view.
- How often should I consider LDS data sharing to circumvent that issue? It seems odd to me that LDS setup and data copy could still be faster than just coalescing the instruction in the first place.
- Do you have suggestions to help the compiler understand what I'm doing?
Other suggestions - even when not directly dealing with the problem - are welcome.