cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

is2013
Journeyman III

ScanLargeArrays Sample in AMD SDK

Hi all,

I am computing prefix sum using ScanLargeArrays sample in SDK of a large array containing 1s and 0s lets say {1,0,0,1,1,0,0,0,0,0,1,1,0,1,1,1} so after prefix sum, output is {0,1,1,1,2,3,3,3,3,3,3,4,5,5,6,7}. Is there a way to determine the number of distinct elements in this prefixed sum array inside any of the 3 kernels in this sample? either counting 1s in input array (somehow in scanLargeArray kernel) or distinct elements in final array (somehow in blockAddition kernel)?

Lets assume that number of 1s and 0s are not known beforehand in cpu and these distinct elements are to be determined inside kernel as the array size is in millions.

Thank you.

0 Likes
1 Solution
himanshu_gautam
Grandmaster

>> Is there a way to determine the number of distinct elements in this prefixed sum array inside any of the 3 kernels in this sample?

Just check the last number in the array... Say it is N..

Then you have 0, 1, 2, until N.

Thats all it is.

Create another kernel with N+1 threads which will write the global_id() into the same location in another buffer.

So, that buffer would contain all the distinct elements in the prefixed-sum array.

So simple.... (or) Did I misread your question?

View solution in original post

0 Likes
5 Replies
himanshu_gautam
Grandmaster

>> Is there a way to determine the number of distinct elements in this prefixed sum array inside any of the 3 kernels in this sample?

Just check the last number in the array... Say it is N..

Then you have 0, 1, 2, until N.

Thats all it is.

Create another kernel with N+1 threads which will write the global_id() into the same location in another buffer.

So, that buffer would contain all the distinct elements in the prefixed-sum array.

So simple.... (or) Did I misread your question?

0 Likes

Yeah it was that simple, my bad. Thank you. I will try this in a while.

and this scanLargeArrays sample works only for array with elements equal power of 2. Do you have idea how to modify it to work for any length arrays? I was getting invalid_work_group_size_error so i changed localThreads in bscan to Null, then i wasn't getting correct output. any solutions?

And this sample creates millions of threads if the array size is in order of millions. Generating that many threads is not good enough for performance, right?

0 Likes

Hi,

>>Do you have idea how to modify it to work for any length arrays? I was getting invalid_work_group_size_error so i changed localThreads in bscan to Null, then i wasn't getting correct output. any solutions?

     Ya this will work for only power of 2, its just because to maintain same workgroupsize always. You can do this for any number of elements but only thing is you have to handle the tail separately thats all. Always check whether the number of elements is power of , if so follow as it is, if not just handle last workgroup separatly. Rest everything will be same.

>>And this sample creates millions of threads if the array size is in order of millions. Generating that many threads is not good enough for performance, right?

     Yes you are correct. Handle this in blocks.

0 Likes

Thank you for such an elaborate reply. The code you mentioned is for my 1st question, right?


himanshu.gautam wrote:



Hi,


>>Do you have idea how to modify it to work for any length arrays? I was getting invalid_work_group_size_error so i changed localThreads in bscan to Null, then i wasn't getting correct output. any solutions?


     Ya this will work for only power of 2, its just because to maintain same workgroupsize always. You can do this for any number of elements but only thing is you have to handle the tail separately thats all. Always check whether the number of elements is power of , if so follow as it is, if not just handle last workgroup separatly. Rest everything will be same.



>>And this sample creates millions of threads if the array size is in order of millions. Generating that many threads is not good enough for performance, right?


     Yes you are correct. Handle this in blocks.


I didn't quite get the quoted text above; say there are 513 elements and workgroupsize is 256. Now how this would be done as it is not power of 2 and is an odd number as well? I should run the sample code for 512 elements and then have to handle the last element? or since the input size is not power of 2 so i should handle the code separately? what do you mean by handling last workgroup separately?

and can you elaborate 'Handle this in blocks' further? this would require a lot of code change both kernel and host side?

0 Likes

Say there are "N" elements.

Say local workgroup size si 256.

No. of workgroups = ((N-1)/256 + 1)

global threads = (No. of workgroups) * 256

inside your kernel:

__kernel void mykernel(int *array, int N)

{

     if (get_global_id(0) < N) {

       array[get_global_id(0)] = get_global_id(0);

    }

}

Thats all,

Best,

Bruhaspati

0 Likes