I read the stream computing user guide 1.4 and got some questions. Need help.
1. About the memory tiling
In the section 126.96.36.199, it said ATI stream processor can have tiled or linear arrangement. “The tiled layout format has a pre-defined sequence of element blocks arranged in sequential memory addresses”. My question is when does the tiling happen? Is it handled by the driver automatically? For a regular data stream defined in a Brook+ program, does the tiling happen between “stream read” and “kernel invoke”? What does the “element blocks” mean? The memory is tiled based on an element or a block that matches the size of a wavefront?
2. About the global buffer
Does any read/write from/to global buffer uncached? Does it means when using gather or scatter streams in Brook+ programs, they are all uncached?
3. About the wavefront
For R770, one SIMD core has 16 SPUs, and a wavefront has 64 threads. Then how “a wavefront processes a single instruction over all of the threads at the same time”? Each SPU gets a quad(2*2) of threads in the wavefront and the 4 threads in the quad switches in a block multi-thread manner, right? Does that mean in at least 4 cycles can the wavefront process a single instruction over all of the threads? One SIMD core may have multiple wavefronts, then when does a wavefront be switched?
4. About the cache
Is the cache 1D or 2D? Please talk more about the cache behavior and how the memory access pattern affects the cache efficiency. Can data in cache be reused between kernels?
5. About compute kernel
In the appendix C.1.3, it says “for best performance by improving memory access, block patterns must be done by the application”. Can you give an programming example for this? The figure seems a bit confusable.