I am learning OpenCL on a PhenomII+Radeon4870 platform and have been following this presentation series:
http://www.macresearch.org/opencl_episode4
In this episode, the presenter does a really good job breaking down the processing elements and their hierarchy for his Nvidia GPU (see slides 7-10 of the PDF, and if you have time the corresponding part of the video is excellent).
Is there any such information available for my GPU? I've looked at the ATI site, but all the information on my card seems to be in terms of graphical primitives (shaders, etc.).
I am especially curious about the "warp" structure (the parts where all threads do the lock-step execution of the exact same code), and how many seperate units of those things I have that can be running different lock-step kernel groups at the same time.
Your GPU has 10 SIMD units which are actually 'Compute Units' in terms of OpenCL. Each SIMD has 16 TPs(Thread processors) which are 'Processing elemens' in OCL.
A TP has 5 execution units(ALUs) which can execute 5 different instructions in 1 cycle but on a single thread. The shader compiler is responsible to find 5 independent instructins from kernel//shader and pack them in a Very Long Instruction Word (VLIW) and all the thread processors execute this VLIW instruction group in 1 cycle.
A 'wavefront' is equivalent to a 'warp'. It consists of 64 threads and threads are executed in order over 4 cycles on the 16 TPs i.e 0-15 for 1 cycle, 16-31 for cycle 2, 32-47 for cycle 3 and 48-63 on the 4th cycle. (There are actually 2 wavefronts which are executed alternately : 16 threads of wavefront1 are executed on 1st cycle then 16 threads of wavefront2 are executed on cycle2 and so on.)
I hope that answers your questions. You can read the Stream User guide for more informationon ATI gpus : developer.amd.com/gpu_assets/Stream_Computing_User_Guide.pdf
Info for the 7XX architecture at:
http://ixbtlabs.com/articles3/video/spravka-r7xx-p1.html