Earlier this week I started a thread for an Avisynth plug-in filter I've called Deathray that performs de-noising on video:
OK - first attempt at posting this thread failed miserably. So if you want to see the source code or use the software please see the thread linked above - forum software that works.
Deathray is BSD licensed.
AMD staff might like to take a look at the kernels. Though they're not particularly complex, one of them, NLMMultiFrameFourPixel uses 55 GPRs on Cypress and 56 on Cayman. I'm quite sure that's far in excess of what's needed.
I plan to make some changes, including making the inner loop use TEX instead of local memory. This reduces the GPR allocation.
The GPR allocation doesn't really impact performance, because the workgroup size is 256 and the inner loop is ~300 cycles long (iterated 49 times) with no off-die memory accesses nor group barriers. But there is a couple of percent impact while the kernel manipulates local memory with numerous group barriers, due to there being only 1 workgroup per SIMD.