I have had some trouble understanding the R670 architecture. From the documentation, forum posts and random web browsing this is what I understood :
a) Radeon 38x0 is organized as a DPP of 16*4. i.e. 4 rows of 16 units each. Each of these 16 units again has 5 stream processors. 1 of these is dedicated for single precision transcendentals. When doing double precision, all 5 are used thus giving performance roughly about 1/5th for MADD but variable for transcendentals.
b) Question : What is the relationship b/w the 16 units. Do they operate in SIMD fashion? How independant are the 64 units? Can they do branching independently? If I understand correctly, when I setup a domain and launch a kernel, its distributed among these 64 processors in some fashion and not in 320 pieces.
c) Syncing : I dont think there is a sync instruction in AMD IL? I mean a kind of a global barrier?
d) Cache : Each of these 64 units has a cache? The caches are independant or are they shared? How big is the cache?
e) Global memory can be read/written by all processors. edit : Global memory operations are probably not synchronized so its not a good idea to write the same memory location from multiple processors?
It will be great if you can have a brief paragraph in the Programming Guide explaining these concepts.
a) Just clarifying terminology (otherwise we'll all start talking in slightly different terminology and get confused... :-)). The FireStream 9170, Radeon HD 3870 and FireGL V7700 are all stream processors using the RV670 GPU. The RV670 has 4 SIMD arrays. Each of those SIMD arrays has 16 thread processors. Each of those thread processors consumes 5-wide VLIW instructions. Each thread processor has 5 stream cores used to process the 5 instructions in the VLIW instruction. Those cores are labeled x, y, z, w and t. All can do int and SPFP. t can do transcendentals also. And DPFP is performed by fusing together x, y, z and w to perform a single DPFP op.
b) All the thread processors in a SIMD array must be running the same instruction on a particular clock cycle. Different SIMD arrays can be running different instructions. However, this level of control is not available directly to the user and is handled by the thread dispatcher inside of the GPU.
c) There isn't a sync instruction yet. This is a feature which may show up in the next few generations of GPUs.
d) I believe the caches are shared by the texture units. Unfortunately, I don't actually know the exact cache size on the current GPUs.
e) You should not count on synchronization between multiple stream cores at this time.
We are actually close to releasing a technical overview (need to proofread it with some of the engineers here and the legal department to make sure we didn't leak inappropriate information. :-)).