cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

AM_902
Journeyman III

Local Memory vs. Registers Only

Performance comparison between local memory based kernels vs. purely register based kernels

Hi,

Recently i implemented a 1D FFT using OpenCL on a Radeon HD5870. My first attempt was to use a kernel that uses only register files for computations. We are using a multistage strategy whereby each kernel reads an entire array and computes multiple FFT stages and writes output back to global memory. Subsequently, as an optimization step i tried to explore use of local memory by closely following the approach used by IPT ATI FFT implementation. What i have noticed is that there is no performance advantage gained in using local memory in this particular case. Even the benches with IPT ATI reveal almost the same performance for all input sizes.

What i know about the bandwidth for the register memory is 13056 GB/sec in contrast to 2176 GB/sec for local memory (Ref: Appendix-D ATI Stream SDK OpenCL Programming Guide June 2010) which seems to be neutralizing any performance gain from Local Memory. It seems that for a kernel with small enough register usage one can get the best performance from register memory only.

Any and all thoughts/suggestions on this would be very interesting and helpful.

0 Likes
5 Replies
himanshu_gautam
Grandmaster

AM_902,

You are correct.Registers have the lowest latency and hence the maximum bandwidth.LDS is more useful only if the data needs to be shared between workgroups or if the memory requirements are huge.

one thing to be kept in mind is that it is our responsibility to make enough wavefronts available to each compute unit to hide latencies effectively.This number of wavefronts supported by compute unit depends on the private memory requirements of the kernel.

0 Likes

himanshu.gautam

I think you ment "data needs to be shared within workgroup".

Regards.

0 Likes

Thanks DTop for figuring out the mistake.

ITs LDS is useful when data needs to be shared between various workitems of the same workgroup.

0 Likes
cjang
Journeyman III

> Even the benches with IPT ATI reveal almost the same performance for all input sizes.

This suggests execution time is dominated by overheads. The FFT is O(N * log(N)) arithmetic operations so the ratio of operations to memory access is O(log(N)) (arithmetic intensity). As the problem size scales, it is not possible to amortize out the kernel execution overhead unless N is very large. The overhead costs are always dominant so many optimizations that are effective have an unnoticeable effect.

0 Likes
homeskyshop
Journeyman III

I'm agree with you but we should care about wavefronts available to each compute unit to hide latencies, then we can easily share data within work groups.

Nazar Suraksha Kavach | slimming tea | slimming oil

0 Likes