Scan algorithm

Discussion created by spectral on Jun 1, 2011
Latest reply on Jun 1, 2011 by LeeHowes


I'm currently working on a scan algorithm for the clpp library, so, it is done but I have discover that the one in the AMD SDK Sample is faster on the CPU... up to 3 times (I have also test it on a NVidia GPU but I got the same performance between the 2 scans) !!!


I have based my code on the following article :



Unfortunately I have found no explanation of the AMD algorithm ! For sure there are some differences with my implementation but I would like to know what difference boost the performance ?

So, I have read the AMD code for the "large scan" sample found in the AMD SDK and I have a few questions.

I have see several differences like :

1 - when calling bScan you use the following values :

blockSize = GROUP_SIZE (256)

global-threads  = values_count / 2

local-threads  = blockSize / 2 = 128

I'm surprised that you use a fixed size for the block

In my version, I use local-threads = 1024/2 because a workgroup has a size of 1024 on the CPU (I also add ).

2 - You use 2 scan versions, but the last scan is a small one so I m not convinced that it change a lot

3 - You do the uniform-addition in one pass !

4 - In you "ScanLargeArrays" kernel you do this :

for(int d = block_size>>1; d > 0; d >>=1)  // block_size>>1 = 128

So, in my version I do 

for(uint d = get_local_size(0); d > 0; d >>= 1) // get_local_size(0) = 512

So, I do 2 additionial iterations than in the AMD code !

If you want to see my code, here it is :http://code.google.com/p/clpp/