cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

tugrul_512bit
Adept III

GCN: 4 workitems per compute unit but with computing float16 in each.

Jump to solution

Normally I give it 256 workitems per workgroup and 64 cores compute those workitems but what if I just pick 4 items with float16 math? Does the compiler map each item to  whole 16-wide SIMDs in any of drivers or GCN versions?

Tags (2)
0 Likes
1 Solution

Accepted Solutions
dipak
Staff
Staff

Re: GCN: 4 workitems per compute unit but with computing float16 in each.

Jump to solution

On CGN, processing elements or vector-lanes of each  SIMD are effectively scalar. I don't think vectorization has much effect on GCN devices as compared to earlier VLIW arch. In fact, vectorization may degrade the performance as mentioned in the optimization guide:

"Notes" under section "Specific Guidelines for GCN family GPUs"

  • Vectorization is no longer needed, nor desirable; in fact, it can affect performance because it requires a greater number of VGPRs for storage. It is recommended not to combine work-items.
  • Read coalescing does not work for 64-bit data sizes. This means reads for float2, int2, and double might be slower than expected.

Regards,

View solution in original post

0 Likes
2 Replies
dipak
Staff
Staff

Re: GCN: 4 workitems per compute unit but with computing float16 in each.

Jump to solution

On CGN, processing elements or vector-lanes of each  SIMD are effectively scalar. I don't think vectorization has much effect on GCN devices as compared to earlier VLIW arch. In fact, vectorization may degrade the performance as mentioned in the optimization guide:

"Notes" under section "Specific Guidelines for GCN family GPUs"

  • Vectorization is no longer needed, nor desirable; in fact, it can affect performance because it requires a greater number of VGPRs for storage. It is recommended not to combine work-items.
  • Read coalescing does not work for 64-bit data sizes. This means reads for float2, int2, and double might be slower than expected.

Regards,

View solution in original post

0 Likes
tugrul_512bit
Adept III

Re: GCN: 4 workitems per compute unit but with computing float16 in each.

Jump to solution

Thank you

0 Likes