you can perform full 32bit int operation but only on T unit. other 4 units can perform only 24int. each unit can do 1 ADD/MUL/MAD SP.
and yes there is huge difference between 4xxx and 5xxx series.
if you chose between then then definitely chose 5xxx card.
The 5XXX architecture is derived, but highly improved, from the 4XXX architecture, so there are major similarities. The major difference when it comes to compute performance, outside of more SIMD's, is a high performance local data share per SIMD. There are also a few new instructions and a more flexible IO system allowing byte addressable stores to be supported.
Thank Nou and Micah. The slide @ sigraph help me a lot. Still have question that why ATI does not have this kine of information publicly official .
@Micah: I have known that 5xx has lots of improvement compared to 4xx. However I still consider the problem of cache/local memory allocating for the local and private variables. Especially the issue that local array is pilled to global memory, has it solved in 2.1 SDK and 5xx? Is still there any memory emulating here?
2.1, and I thought 2.0/2.01, did not push local memory into global on 5XXX cards. In 2.1, private memory is now represented by scratch buffers and not global emulated.
where scratch buffers localized on 5xxx and on 4xxx GPUs? Not in global memory? In shared memory (that is, register spilling will cut from 16k of shared memory on 5xxx GPUs? )
And what about shared memory on 4870 ? It not so versatile as new one on 5xxx, but maybe it could be somehow exposed into OpenCL too? Or good piece of fast memory just lost completely for the future on 4xxx GPUs?
There is currently no plan to expose 4XXX hardware local memory in OpenCL.
Scratch buffers are stored in global memory, but unlike the emulated global memory, they can be optimized away by the CAL compiler in some cases and use register indexing in others.
It's strange to me that private memory is now represented by the scratch buffer which is stored in global memory. Event it is "not emulated" and ATI have very special strategy to manage this chunk of scratch buffer on global memory(a.k.a VRAM), it still much slower than on-chip memory. So what the reason here? from perspective of programming we expect that private memory or scratch buffer should be fast but it is not allocated in on-chip memory! Why ATI now has physical on-chip local memory (according to you) but still put the private memory on global memory?
I have gone through 5xx architecture slides of ATI @ Siggraph (http://sa09.idav.ucdavis.edu/docs/SA09_AMD_IHV.pdf) that nou recommended and I came up with a question that: from slide 16 to 20 ATI talks about two different type of shared memory that Local Shared and Global Shared memory. Why do you guys need two type of the shared memory and on the execution how and where the SDK 2.1 locate the shared memory? and is that the physical "shared memory" here is relevant to the logical "local memory" of openCL spec?
And in summary could you please specify where the following OpenCL logical memories are PHYSICALLY located in 5xxx series?