Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- AMD Community
- Communities
- Developers
- Devgurus Archives
- Archives Discussions
- ScanLargeArrays Sample in AMD SDK

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-21-2013
11:48 PM

ScanLargeArrays Sample in AMD SDK

Hi all,

I am computing prefix sum using ScanLargeArrays sample in SDK of a large array containing 1s and 0s lets say {1,0,0,1,1,0,0,0,0,0,1,1,0,1,1,1} so after prefix sum, output is {0,1,1,1,2,3,3,3,3,3,3,4,5,5,6,7}. Is there a way to determine the number of distinct elements in this prefixed sum array inside any of the 3 kernels in this sample? either counting 1s in input array (somehow in scanLargeArray kernel) or distinct elements in final array (somehow in blockAddition kernel)?

Lets assume that number of 1s and 0s are not known beforehand in cpu and these distinct elements are to be determined inside kernel as the array size is in millions.

Thank you.

Solved! Go to Solution.

1 Solution

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-22-2013
01:22 AM

>> Is there a way to determine the number of distinct elements in this prefixed sum array inside any of the 3 kernels in this sample?

Just check the last number in the array... Say it is N..

Then you have 0, 1, 2, until N.

Thats all it is.

Create another kernel with N+1 threads which will write the global_id() into the same location in another buffer.

So, that buffer would contain all the distinct elements in the prefixed-sum array.

So simple.... (or) Did I misread your question?

5 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-22-2013
01:22 AM

>> Is there a way to determine the number of distinct elements in this prefixed sum array inside any of the 3 kernels in this sample?

Just check the last number in the array... Say it is N..

Then you have 0, 1, 2, until N.

Thats all it is.

Create another kernel with N+1 threads which will write the global_id() into the same location in another buffer.

So, that buffer would contain all the distinct elements in the prefixed-sum array.

So simple.... (or) Did I misread your question?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-22-2013
01:41 PM

Yeah it was that simple, my bad. Thank you. I will try this in a while.

and this scanLargeArrays sample works only for array with elements equal power of 2. Do you have idea how to modify it to work for any length arrays? I was getting invalid_work_group_size_error so i changed localThreads in bscan to Null, then i wasn't getting correct output. any solutions?

And this sample creates millions of threads if the array size is in order of millions. Generating that many threads is not good enough for performance, right?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-22-2013
11:02 PM

Hi,

>>Do you have idea how to modify it to work for any length arrays? I was getting invalid_work_group_size_error so i changed localThreads in bscan to Null, then i wasn't getting correct output. any solutions?

Ya this will work for only power of 2, its just because to maintain same workgroupsize always. You can do this for any number of elements but only thing is you have to handle the tail separately thats all. Always check whether the number of elements is power of , if so follow as it is, if not just handle last workgroup separatly. Rest everything will be same.

>>And this sample creates millions of threads if the array size is in order of millions. Generating that many threads is not good enough for performance, right?

Yes you are correct. Handle this in blocks.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-23-2013
12:47 AM

Thank you for such an elaborate reply. The code you mentioned is for my 1st question, right?

himanshu.gautam wrote:

Hi,

>>Do you have idea how to modify it to work for any length arrays? I was getting invalid_work_group_size_error so i changed localThreads in bscan to Null, then i wasn't getting correct output. any solutions?

Ya this will work for only power of 2, its just because to maintain same workgroupsize always. You can do this for any number of elements but only thing is you have to handle the tail separately thats all. Always check whether the number of elements is power of , if so follow as it is, if not just handle last workgroup separatly. Rest everything will be same.

>>And this sample creates millions of threads if the array size is in order of millions. Generating that many threads is not good enough for performance, right?

Yes you are correct. Handle this in blocks.

I didn't quite get the quoted text above; say there are 513 elements and workgroupsize is 256. Now how this would be done as it is not power of 2 and is an odd number as well? I should run the sample code for 512 elements and then have to handle the last element? or since the input size is not power of 2 so i should handle the code separately? what do you mean by handling last workgroup separately?

and can you elaborate 'Handle this in blocks' further? this would require a lot of code change both kernel and host side?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-22-2013
11:37 PM

Say there are "N" elements.

Say local workgroup size si 256.

No. of workgroups = ((N-1)/256 + 1)

global threads = (No. of workgroups) * 256

inside your kernel:

__kernel void mykernel(int *array, int N)

{

if (get_global_id(0) < N) {

array[get_global_id(0)] = get_global_id(0);

}

}

Thats all,

Best,

Bruhaspati