Hi folks,
So I am planning to buy an ATI 5870 for my work on OpenCL and I may need your help, collaborative minds. Strangely enough I could not find any official tech detail specification of this card from ATI/AMD. I am considering specs like how much constant cache, L2 cache, Shared memory, private memory it have per core. Is that any improvement compared to 48xx series interms of resource per core or just similar? Can you direct me to a source of architecture detail specs documents for this card as well as other cards? In generally where does ATI put them?
Additionally, I read some reviews and they said that if you play with a vector size of 4 you cannot have full 32 bits operation. E.x. Toms Hardware said: "Now the four cores are capable of performing a multiplication or addition per cycle, but only on 24-bit integers"--> so what is the truth? If Toms is right so how the 5870 handle operation on datatype of int4?
Last but not least, is that much more difference in terms of architecture between 5870 and 4890 card? I am spending a huge amount of $ so just want to be sure I pick a right one
Thank you very much,
Roto
http://sa09.idav.ucdavis.edu/docs/SA09_AMD_IHV.pdf
you can perform full 32bit int operation but only on T unit. other 4 units can perform only 24int. each unit can do 1 ADD/MUL/MAD SP.
and yes there is huge difference between 4xxx and 5xxx series.
if you chose between then then definitely chose 5xxx card.
Thank Nou and Micah. The slide @ sigraph help me a lot. Still have question that why ATI does not have this kine of information publicly official .
@Micah: I have known that 5xx has lots of improvement compared to 4xx. However I still consider the problem of cache/local memory allocating for the local and private variables. Especially the issue that local array is pilled to global memory, has it solved in 2.1 SDK and 5xx? Is still there any memory emulating here?
Thanks,
Rt
Hi Micah,
It's strange to me that private memory is now represented by the scratch buffer which is stored in global memory. Event it is "not emulated" and ATI have very special strategy to manage this chunk of scratch buffer on global memory(a.k.a VRAM), it still much slower than on-chip memory. So what the reason here? from perspective of programming we expect that private memory or scratch buffer should be fast but it is not allocated in on-chip memory! Why ATI now has physical on-chip local memory (according to you) but still put the private memory on global memory?
I have gone through 5xx architecture slides of ATI @ Siggraph (http://sa09.idav.ucdavis.edu/docs/SA09_AMD_IHV.pdf) that nou recommended and I came up with a question that: from slide 16 to 20 ATI talks about two different type of shared memory that Local Shared and Global Shared memory. Why do you guys need two type of the shared memory and on the execution how and where the SDK 2.1 locate the shared memory? and is that the physical "shared memory" here is relevant to the logical "local memory" of openCL spec?
And in summary could you please specify where the following OpenCL logical memories are PHYSICALLY located in 5xxx series?
-Local memory
-Constant memory
-Private memory
Many thanks,
Roto