Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Problems about wavefront, global buffer, cache and compute kernel

I read the stream computing user guide 1.4 and got some questions. Need help.


1. About the memory tiling

In the section, it said ATI stream processor can have tiled or linear arrangement. “The tiled layout format has a pre-defined sequence of element blocks arranged in sequential memory addresses”. My question is when does the tiling happen? Is it handled by the driver automatically? For a regular data stream defined in a Brook+ program, does the tiling happen between “stream read” and “kernel invoke”? What does the “element blocks” mean? The memory is tiled based on an element or a block that matches the size of a wavefront?


2. About the global buffer

Does any read/write from/to global buffer uncached? Does it means when using gather or scatter streams in Brook+ programs, they are all uncached?


3. About the wavefront

For R770, one SIMD core has 16 SPUs, and a wavefront has 64 threads. Then how “a wavefront processes a single instruction over all of the threads at the same time”? Each SPU gets a quad(2*2) of threads in the wavefront and the 4 threads in the quad switches in a block multi-thread manner, right? Does that mean in at least 4 cycles can the wavefront process a single instruction over all of the threads? One SIMD core may have multiple wavefronts, then when does a wavefront be switched?


4. About the cache

Is the cache 1D or 2D? Please talk more about the cache behavior and how the memory access pattern affects the cache efficiency. Can data in cache be reused between kernels?


5. About compute kernel

In the appendix C.1.3, it says “for best performance by improving memory access, block patterns must be done by the application”. Can you give an programming example for this? The figure seems a bit confusable.

2 Replies

1) In Brook+ tiling is done on all streams that are not accessed via scatter/gather and is done implicitly by the runtime. A tiling block is 8x8 elements and matches the wavefront.
2) yes this is correct, global buffer is currently done as an uncached read/write.
3) Please see the thread with ID 115872, 'calculating the bottleneck'
4) This has been a longstanding request and we are working with the documentation folks on this.
5) please look at lds_transpose.

For the 1st problem: As you said the tiling operation is done by the runtime. If I access a stream in gather mode and normal mode in a application, how does the runtime deal with this situation?