Hi, I have a question about best practices for local memory accesses. Section 6.2 of the AMD OpenCL programming guide (v2.4), on page 6-17 reads:
"A simple sequential address pattern, where each work-item reads a float2 value from LDS, generates a conflict-free access pattern on the ATI Radeon™ HD 5870 GPU. Note that a sequential access pattern, where each work-item reads a float4 value from LDS, uses only half the banks on each cycle on the ATI Radeon™ HD 5870 GPU and delivers half the performance of the float access pattern."
The first sentence makes sense to me. I have two questions about the second:
1. Am I correct in assuming that the second sentence makes reference to the first and should read "..delivers only half the performance of the float2 access pattern." ?
2. I understand that a quarter wavefront accessing float4 values will generate bank conflicts. However, (if my above assumption is correct) how does this deliver only half the performance of a float2 access pattern?
For example, suppose each work-item ultimately needs to access 4 float values. If each reads a float4 straight up, bank conflicts will occur and we'll need 2 cycles (?) to service a quarter wavefront of 16 work-items. However, if each work-item reads a float2, we'll need to use a loop that iterates twice, each time reading a float2 (requiring 1 cycle), yeilding the same (?) total access time of 2 cycles. I'm probably missing something important about how bank conflicts are resolved...
By the way AMD, your documentation is great, and has been immensely helpful thus far. Thanks!
Somewhat embarrassingly, shortly after posting I located another thread that explains this exact scenario. If anyone passes through here, have a look at LeeHowes' response in this thread.