Archives Discussions

boxerab · ‎08-05-2014

What is the best way of layout this out in local memory to reduce bank conflicts ?

I was thinking:

RRRRRRRRRRRR... GGGGGGGGGGGG... BBBBBBBBBBBB... AAAAAAAAAAAA...

I would like to grab all four channels at once to use in vector operations.

Thanks!

dipak · ‎09-22-2014

Hi,

As per your intention, a four component vector type would be ideal for your case. To achieve optimum performance, you need to consider two main points before choosing the size of the vector type: 1) the max. number of LDS bytes that can be requested by each stream core per cycle [the actual number depends on gpu architecture] and 2) memory bank requests served by LDS / cycle. For example, if you go through the following section in APP Programming Guide, it says:

6. OpenCL Performance and Optimization for GCN Devices->6.2 Local Memory (LDS) Optimization

A simple sequential address pattern, where each work-item reads a float2 value from LDS, generates a conflict-free access pattern on the AMD Radeon HD 7XXX GPU. Note that a sequential access pattern, where each work-item reads a float4 value from LDS, uses only half the banks on each cycle on the AMD Radeon. HD 7XXX GPU and delivers half the performance of the float access pattern.

Now, suppose that you've chosen each color channel as float/int, so, it will exhibit same problem as reading of float4 or int4 value from LDS and accordingly performance will be half.

Regards,

Archives Discussions

Best memory layout for RGBA data in local memory?