documentation : suggestions and doubts

Apr 23, 2008

I have had some trouble understanding the R670 architecture.
From the documentation, forum posts and random web browsing this is what I understood :

a) Radeon 38x0 is organized as a DPP of 16*4. i.e. 4 rows of 16 units each. Each of these 16 units again has 5 stream processors. 1 of these is dedicated for single precision transcendentals. When doing double precision, all 5 are used thus giving performance roughly about 1/5th for MADD but variable for transcendentals.

b) Question : What is the relationship b/w the 16 units. Do they operate in SIMD fashion? How independant are the 64 units? Can they do branching independently? If I understand correctly, when I setup a domain and launch a kernel, its distributed among these 64 processors in some fashion and not in 320 pieces.

c) Syncing : I dont think there is a sync instruction in AMD IL? I mean a kind of a global barrier?

d) Cache : Each of these 64 units has a cache? The caches are independant or are they shared? How big is the cache?

e) Global memory can be read/written by all processors.
edit : Global memory operations are probably not synchronized so its not a good idea to write the same memory location from multiple processors?

It will be great if you can have a brief paragraph in the Programming Guide explaining these concepts.