I'm currently working on a scan algorithm for the clpp library, so, it is done but I have discover that the one in the AMD SDK Sample is faster on the CPU... up to 3 times (I have also test it on a NVidia GPU but I got the same performance between the 2 scans) !!!
I have based my code on the following article :
Unfortunately I have found no explanation of the AMD algorithm ! For sure there are some differences with my implementation but I would like to know what difference boost the performance ?
So, I have read the AMD code for the "large scan" sample found in the AMD SDK and I have a few questions.
I have see several differences like :
1 - when calling bScan you use the following values :
blockSize = GROUP_SIZE (256)
global-threads = values_count / 2
local-threads = blockSize / 2 = 128
I'm surprised that you use a fixed size for the block
In my version, I use local-threads = 1024/2 because a workgroup has a size of 1024 on the CPU (I also add ).
2 - You use 2 scan versions, but the last scan is a small one so I m not convinced that it change a lot
3 - You do the uniform-addition in one pass !
4 - In you "ScanLargeArrays" kernel you do this :
for(int d = block_size>>1; d > 0; d >>=1) // block_size>>1 = 128
So, in my version I do
for(uint d = get_local_size(0); d > 0; d >>= 1) // get_local_size(0) = 512
So, I do 2 additionial iterations than in the AMD code !
If you want to see my code, here it is :http://code.google.com/p/clpp/