Recently i implemented a 1D FFT using OpenCL on a Radeon HD5870. My first attempt was to use a kernel that uses only register files for computations. We are using a multistage strategy whereby each kernel reads an entire array and computes multiple FFT stages and writes output back to global memory. Subsequently, as an optimization step i tried to explore use of local memory by closely following the approach used by IPT ATI FFT implementation. What i have noticed is that there is no performance advantage gained in using local memory in this particular case. Even the benches with IPT ATI reveal almost the same performance for all input sizes.
What i know about the bandwidth for the register memory is 13056 GB/sec in contrast to 2176 GB/sec for local memory (Ref: Appendix-D ATI Stream SDK OpenCL Programming Guide June 2010) which seems to be neutralizing any performance gain from Local Memory. It seems that for a kernel with small enough register usage one can get the best performance from register memory only.
Any and all thoughts/suggestions on this would be very interesting and helpful.
You are correct.Registers have the lowest latency and hence the maximum bandwidth.LDS is more useful only if the data needs to be shared between workgroups or if the memory requirements are huge.
one thing to be kept in mind is that it is our responsibility to make enough wavefronts available to each compute unit to hide latencies effectively.This number of wavefronts supported by compute unit depends on the private memory requirements of the kernel.
> Even the benches with IPT ATI reveal almost the same performance for all input sizes.
This suggests execution time is dominated by overheads. The FFT is O(N * log(N)) arithmetic operations so the ratio of operations to memory access is O(log(N)) (arithmetic intensity). As the problem size scales, it is not possible to amortize out the kernel execution overhead unless N is very large. The overhead costs are always dominant so many optimizations that are effective have an unnoticeable effect.