Archives Discussions

sandyandr · ‎07-19-2012

Would somebody help me optimize my kernel for GCN (HD7970)? I just need to do something byte-wise (let it be xor for simplification) with two matrices (2048x2048=4194304 bytes each) and get the result matrix with the same size.

So, I decided to make group size = 256 x 1 to turn on all stream cores of each CU. By the way, I don't see here any other options - 256 is the high limit but all other numbers (128, 64) make some stream cores idle - right?

As I've understood from APP Guide, to make CU memory accesses simultaneous, each next work-group (CU) should access "next bank and next channel".

So, it means, that the most suitable work-item kernel code here should be made in a way, that allows work group to process this amount of bytes:

256 * 9 = 2304 bytes (3 channel bits - to move to the next bank, which gives 8 and one - to make an occupation of the same channel less probable), right?

Because of it, I should make all work-items calculating: 2304 bytes / 256 work-items = 9 adjacent bytes, read and written at once (how to do it, by the way?). As a result, NDRange = 256x1821, with 1280 extra bytes processed.

Am I right with all these calculations and will the promised performance boost worth computing of extra bytes?

nou · ‎07-20-2012

another work group.

View solution in original post

binying · ‎07-19-2012

256 is the high limit but all other numbers (128, 64) make some stream cores idle - right?--No, the other numbers don't make some stream cores idle. So you may try another workgroup size to minimize the extra bytes processed.

sandyandr · ‎07-19-2012

And I still don't understand it clearly. If, for instance, workgroup size is 128, with wavefront size = 64 I will have only two wavefronts for CU, but there are four stream cores, right? What instructions will be executing on the other two? Please, clarify, it really puzzles me!

nou · ‎07-20-2012

another work group.

sandyandr · ‎07-20-2012

Thank you. I just have found it in APP Guide. But can I be sure, that smaller work-groups (128 items) with the same computing width (9 bytes) of each item will fit each CU in strict order (even with next odd one) to keep all CUs accessing different memory banks and channels at the same time? As far as I know, I can't. So, reduction of work-group size seems not helpful here.

Archives Discussions

kernel memory optimization for GCN