AnsweredAssumed Answered

kernel memory optimization for GCN

Question asked by sandyandr on Jul 19, 2012
Latest reply on Jul 20, 2012 by sandyandr

Would somebody help me optimize my kernel for GCN (HD7970)? I just need to do something byte-wise (let it be xor for simplification) with two matrices (2048x2048=4194304 bytes each) and get the result matrix with the same size.

So, I decided to make group size = 256 x 1 to turn on all stream cores of each CU. By the way, I don't see here any other options - 256 is the high limit but all other numbers (128, 64) make some stream cores idle - right?

As I've understood from APP Guide, to make CU memory accesses simultaneous, each next work-group (CU) should access "next bank and next channel".

So, it means, that the most suitable work-item kernel code here should be made in a way, that allows work group to process this amount of bytes:

256 * 9 = 2304 bytes (3 channel bits - to move to the next bank, which gives 8 and one  - to make an occupation of the same channel less probable), right?

Because of it, I should make all work-items calculating: 2304 bytes / 256 work-items = 9 adjacent bytes, read and written at once (how to do it, by the way?). As a result, NDRange = 256x1821, with 1280 extra bytes processed.

Am I right with all these calculations and will the promised performance boost worth computing of extra bytes?