Would somebody help me optimize my kernel for GCN (HD7970)? I just need to do something byte-wise (let it be xor for simplification) with two matrices (2048x2048=4194304 bytes each) and get the result matrix with the same size.
So, I decided to make group size = 256 x 1 to turn on all stream cores of each CU. By the way, I don't see here any other options - 256 is the high limit but all other numbers (128, 64) make some stream cores idle - right?
As I've understood from APP Guide, to make CU memory accesses simultaneous, each next work-group (CU) should access "next bank and next channel".
So, it means, that the most suitable work-item kernel code here should be made in a way, that allows work group to process this amount of bytes:
256 * 9 = 2304 bytes (3 channel bits - to move to the next bank, which gives 8 and one - to make an occupation of the same channel less probable), right?
Because of it, I should make all work-items calculating: 2304 bytes / 256 work-items = 9 adjacent bytes, read and written at once (how to do it, by the way?). As a result, NDRange = 256x1821, with 1280 extra bytes processed.
Am I right with all these calculations and will the promised performance boost worth computing of extra bytes?