Recently i implemented a 1D FFT using OpenCL on a Radeon HD5870. My first attempt was to use a kernel that uses only register files for computations. We are using a multistage strategy whereby each kernel reads an entire array and computes multiple FFT stages and writes output back to global memory. Subsequently, as an optimization step i tried to explore use of local memory by closely following the approach used by IPT ATI FFT implementation. What i have noticed is that there is no performance advantage gained in using local memory in this particular case. Even the benches with IPT ATI reveal almost the same performance for all input sizes.
What i know about the bandwidth for the register memory is 13056 GB/sec in contrast to 2176 GB/sec for local memory (Ref: Appendix-D ATI Stream SDK OpenCL Programming Guide June 2010) which seems to be neutralizing any performance gain from Local Memory. It seems that for a kernel with small enough register usage one can get the best performance from register memory only.
Any and all thoughts/suggestions on this would be very interesting and helpful.