Hi, I have a question about best practices for local memory accesses. Section 6.2 of the AMD OpenCL programming guide (v2.4), on page 6-17 reads:
"A simple sequential address pattern, where each work-item reads a float2 value from LDS, generates a conflict-free access pattern on the ATI Radeon™ HD 5870 GPU. Note that a sequential access pattern, where each work-item reads a float4 value from LDS, uses only half the banks on each cycle on the ATI Radeon™ HD 5870 GPU and delivers half the performance of the float access pattern."
The first sentence makes sense to me. I have two questions about the second:
1. Am I correct in assuming that the second sentence makes reference to the first and should read "..delivers only half the performance of the float2 access pattern." ?
2. I understand that a quarter wavefront accessing float4 values will generate bank conflicts. However, (if my above assumption is correct) how does this deliver only half the performance of a float2 access pattern?
For example, suppose each work-item ultimately needs to access 4 float values. If each reads a float4 straight up, bank conflicts will occur and we'll need 2 cycles (?) to service a quarter wavefront of 16 work-items. However, if each work-item reads a float2, we'll need to use a loop that iterates twice, each time reading a float2 (requiring 1 cycle), yeilding the same (?) total access time of 2 cycles. I'm probably missing something important about how bank conflicts are resolved...
By the way AMD, your documentation is great, and has been immensely helpful thus far. Thanks!