Ok, sorry about the "Horrible 5870 performance" but this goes to the same topic...
... why is the 64x1 block size performance so horrid?
Compute shader might be faster but you really need to know how to get perfect texture fetch to make it so.
Accessing naively (64x1) gives HORRIBLE performance... WAY worse than pixel shader mode. And if LDS isn't any faster... I mean how many applications out there really need LDS?
What's more curious to me, is that the 4870 runs twice as fast for accessing that block size in compute shader mode than the 5870.
Also, for the same number of inputs in pixel shader mode and an increasing ALU:Fetch ratio, the 4870 changes from texture bound to ALU bound at a lower ALU:Fetch ratio than the 5870, though the overall execution is still faster on the 5870.
Micah,
In compute shader mode using a 64x1 block size the 5870 performs twice as bad as the 4870 for a very simple benchmark. It would be great if you could verify this. The kernel code for all my benchmarks stays the same when I run it across the cards.
I do have several benchmarks trying to test different parameters/aspects of the newer generation cards; however, at the moment I'm only using 64x1 block size. I will be trying an 8x8 block size shortly I think, if I have time, I'm on a deadline.
Micah,
I just have two more questions:
1. It doesn't seem that streaming store and global write are any different? Is that true?
2. I get better performance on the 5870 using streaming store and global read than I do using streaming store and texture fetch in pixel shader mode for float4 data types. Does that sound right? What I mean is that it appears for pixel shader mode texture fetch stays the bottleneck for even very high ALU:Fetch ratios, while this is not true if I use Global Read.
Originally posted by: ryta1203 Micah,
I just have two more questions:
1. It doesn't seem that streaming store and global write are any different? Is that true?
2. I get better performance on the 5870 using streaming store and global read than I do using streaming store and texture fetch in pixel shader mode for float4 data types. Does that sound right? What I mean is that it appears for pixel shader mode texture fetch stays the bottleneck for even very high ALU:Fetch ratios, while this is not true if I use Global Read.
1. Would still like an answer.
2. What I mean is, for float4 on 5870 (greater is better)
4x16 > 64x1 block size in compute shader mode
global read in pixel shader mode > texture fetch in pixel shader mode
4x16 in compute shader mode ~= global read in compute shade mode
Do these comparisons look accurate? It still seems that pixel shader mode is faster, regardless, even though ATI claims that theoretically it shouldn't be.
However, for the 4870, I found that 4x16 texture fetch was much better than global read. On the 4870, I found that global read was the same for float, float4, pixel and compute (any combo thereof); however, on the 5870, it seems pixel was a little faster than compute. I have a lot of results, and so a lot of questions, but I don't have the time to share them all to see if the results are accurate or not.