Archives Discussions

spectral · ‎06-01-2011

Hi,

I'm currently working on a scan algorithm for the clpp library, so, it is done but I have discover that the one in the AMD SDK Sample is faster on the CPU... up to 3 times (I have also test it on a NVidia GPU but I got the same performance between the 2 scans) !!!

I have based my code on the following article :

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf

Unfortunately I have found no explanation of the AMD algorithm ! For sure there are some differences with my implementation but I would like to know what difference boost the performance ?

So, I have read the AMD code for the "large scan" sample found in the AMD SDK and I have a few questions.

I have see several differences like :

1 - when calling bScan you use the following values :

blockSize = GROUP_SIZE (256)

global-threads = values_count / 2

local-threads = blockSize / 2 = 128

I'm surprised that you use a fixed size for the block

In my version, I use local-threads = 1024/2 because a workgroup has a size of 1024 on the CPU (I also add ).

2 - You use 2 scan versions, but the last scan is a small one so I m not convinced that it change a lot

3 - You do the uniform-addition in one pass !

4 - In you "ScanLargeArrays" kernel you do this :

for(int d = block_size>>1; d > 0; d >>=1) // block_size>>1 = 128

So, in my version I do

for(uint d = get_local_size(0); d > 0; d >>= 1) // get_local_size(0) = 512

So, I do 2 additionial iterations than in the AMD code !

If you want to see my code, here it is :http://code.google.com/p/clpp/

spectral · ‎06-01-2011

After a few review, I notice that my code is approximatively the same than the one from AMD, except that you have a fixed size for the BLOCK (256 values) !

So, I have simply try to fix this value in my code... so now I have the same number of pass etc... but still have poor perf on the CPU !

LeeHowes · ‎06-01-2011

The CPU only has a little vector. Don't do a parallel scan like that on the CPU, it'll always be inefficient. Just do:

float sum;

For( int i = 0; i < block_size; ++i )

sum += input;

With a workgroup size of 1, issue NumCores work groups and do a trivial scan across the NumCores workgroups at the end. Then do a second pass to all but the first group to add those values in:

For( int i = 0; i < block_size; ++i )

sum += input;

output = sum + sumToEndOfPreviousGroup;

For the GPU you might as well fix the block size there's no reason not to. You'll have to vectorise the above code to make it work on the GPU, of course. You could do a group scan in the way the SDK sample does, but i don't recall that being a very efficient implementation (remember, work efficient scans == bad idea on GPU because you always have a vector you might as well not mask it).

You could do a first pass as a completely vector reduction over 1/(NumCores*WavesPerCore) chunk of the array and store those intermediate reduction values in the same way as the CPU. Then do a second pass that computes all the intermediate sums and adds that value on. I don't think you want to be outputting intermediate prefix sum data on a per-value basis, only on a per workgroup basis.

Archives Discussions

Scan algorithm