Hi, understanding memory related performance aspects are important but sometimes a bit tricky. They also change from architecture to architecture.
Here are couple of questions (relating MemoryOptimization benchmark in SDK):...
My setup: W5000 (Pitcairn), OpenCL 1.2 AMD-APP (1124.2)
According to my understanding
1. Copy 1D Fast path is not having any sort conditions it. So all the workitems just perform copy instruction. Where as in Copy 1D complete path, all the workitem irrespective whether it is <0(first of all there are no gid < 0) has to check the condition and then need s to perform copy operation . So only its slower.
2. Its not exactly the 2 or 4 times the faster than signle float. It again depends on the logic as well.