I'm looking at some results I got from running pixel and compute modes.
For almost the same code, the compute shader runs way worse while alu is not the bottleneck (even then it's still a little slower). Are there any ideas why this is the case? There is only 1 output and I'm only using the global buffer for the output, the inputs are still using texture fetches.
For example, with 12 inputs, ALU:Fetch of 4.0 (according to SKA equation, which is actually 16.0), 1 output 5000 iterations, I get the following times:
CS: 18.639 (ALU is not the bottleneck here, I'm not sure what is, it would appear to be memory but there is only 1 output, the inputs are texture fetches)
PS: 11.5 (ALU IS the bottleneck here)
Anyone have an idea why the big difference?
EDIT: Forgot to mention, run on HD4870, no branching, no data reuse, float4 data types, the code is almost exactly the same for both kernels (minus the domain calculations for the compute shader).