I've finally gotten a compute shader working using LDS, doing matrix multiply as a test (code here). However, I am disappointed at its performance at about 200GFLOP/s on a 4870 which is worse than e.g. the sdk pixel shader code.
The idea was to store a block of one of the matrices in the LDS for each wavefront, to cut down on global memory access. Each wavefront computes a large subblock, again to reduce memory access. Then, all threads in the wavefront read the same element of the LDS at any given time (i.e. first all threads in the wavefront read element 0 of thread 0, then all threads read element 0 of thread 1, and so on). I had imagined that this might only count as 1 read of the LDS, rather than 64, and so be "fast" -- does anybody know if this the case or not? I am wondering if this might be the bottleneck.
If reads aren't "broadcast" then it does suggest that the LDS is not necessarily ideally suited to being used as a "cache"; have people found other good uses for it, that couldn't be done say with synchronising on global memory accesses? If shared registers were addressible in il code then they could potentially play the role of a cache -- might this be in the pipeline?