Archives Discussions

afo · ‎06-16-2010

What could be the best approach to a variable output size from a kernel?

Hi all,

First of all, thanks a lot for your advices, reading this forum helps me a lot. But now I am facing a problem and I am not sure which could be the best approach:

I have a kernel that solves a linear system of equations. the system is subdetermined, so the kernel generates many solutions. The vast majority of them are bad and a little subset are good.

So I have a second kernel that receives all the solutions from the first kernel and checks if a solution is good or bad, adding a mark indicating good or bad. So the final output is a buffer filled with all the solutions plus the good/bad mark for everyone.

This scheme generates large amounts of memory transfer from GPU to Host, so I am looking for a way to transfer only the good solutions.

My first shoot is to generate a third kernel that copies only the good solutions from the output of the second kernel to an output buffer and puts the total number of good solutions in the first place of the output buffer. And in the host I will generate 2 memory transfers, one to read the total number of solutions, and the second one to transfer the output buffer with only the good solutions.

I am not sure if it is a better approach to deal with this problem, so any insight about this will be very appreciated.

best regards,

Alfonso

niravshah00 · ‎06-23-2010

Hi ,

I have the same problem but i am using Brook+ SDK 1.4.0

So i thought OpenCL would have a better solution.

But looking at your post tell me that i would get stuck even if i upgraded to OpenCL.

Let me know if you found something about this problem.

MicahVillmow · ‎06-23-2010

niravshah00/afo,
Doing variable sized output is actually quite simple if you have a HD5XXX series card.
1) Allocate of Type n an output buffer holding the maximum amount of memory that you will need to write to.
2) Have a global pointer to a single integer and initialize it to zero.
3) Execute your kernel and determine if the results should be written out or not
3a) If the results are good, call atom_inc on the global int pointer, the return values is the index where you write your results
3b) If the results are bad, don't write out your results
4) Read the memory object that contains the global int pointer to find out how many values are good
5) Allocate a memory buffer equal to the exact amount of memory required to copy the 'good' values
6) Execute a kernel that copies data from your 'large' buffer to the exactly size buffer created in step 5.

Now, if the int returns a value where it is almost equal to the max size, you can skip steps 5 and 6 and just copy the whole buffer back as the over head of creating a new buffer and executing a new kernel might not be worth it. However, if you have say a 100MB buffer allocated in step 1, but you only need to copy back 10KB of data, steps 5/6 might offer significant performance improvement.

Also, you can improve step 5 by preallocating some buffers and then just select the buffer you want based on the size of your valid output data. The only drawback with this approach is that your initial output buffer must be the worst-case size all the time.

niravshah00 · ‎06-23-2010

How to implement this in Brook+?

Or is there any workaround for this in Brook+

niravshah00 · ‎06-24-2010

Originally posted by: MicahVillmow niravshah00/afo, Doing variable sized output is actually quite simple if you have a HD5XXX series card. 1) Allocate of Type n an output buffer holding the maximum amount of memory that you will need to write to. 2) Have a global pointer to a single integer and initialize it to zero. 3) Execute your kernel and determine if the results should be written out or not 3a) If the results are good, call atom_inc on the global int pointer, the return values is the index where you write your results 3b) If the results are bad, don't write out your results 4) Read the memory object that contains the global int pointer to find out how many values are good 5) Allocate a memory buffer equal to the exact amount of memory required to copy the 'good' values 6) Execute a kernel that copies data from your 'large' buffer to the exactly size buffer created in step 5. Now, if the int returns a value where it is almost equal to the max size, you can skip steps 5 and 6 and just copy the whole buffer back as the over head of creating a new buffer and executing a new kernel might not be worth it. However, if you have say a 100MB buffer allocated in step 1, but you only need to copy back 10KB of data, steps 5/6 might offer significant performance improvement. Also, you can improve step 5 by preallocating some buffers and then just select the buffer you want based on the size of your valid output data. The only drawback with this approach is that your initial output buffer must be the worst-case size all the time.

Is it possible on FireStream 9270 cards?

afo · ‎06-24-2010

Micah and developers:

I think that a variable output kernel example would be useful in the SDK examples, but it could be less important than other examples; so, perhaps it could be possible to create a "contributed examples" repository (as other projects do), where one could upload an example showing features less used (probably with AMD's approval). In this way, there will be a lot more examples available to learn and it will not load developer's time to maintain/test them.

best regards,

Alfonso

LeeHowes · ‎06-24-2010

The more standard way to do this in parallel programming would be to do a parallel prefix sum over the valid outputs and then run a kernel that compacts based on the prefix sum values. That would work without atomic support, it might also be faster than the atomic version because the atomic version needs the pre-op and would do a final unordered scatter stage. Better, still, it maintains order if this is important to you - you get a stable compaction/binary sort.

Or to do it in directcompute and use the append buffer functionality

niravshah00 · ‎06-24-2010

Originally posted by: LeeHowes The more standard way to do this in parallel programming would be to do a parallel prefix sum over the valid outputs and then run a kernel that compacts based on the prefix sum values. That would work without atomic support, it might also be faster than the atomic version because the atomic version needs the pre-op and would do a final unordered scatter stage. Better, still, it maintains order if this is important to you - you get a stable compaction/binary sort.

Or to do it in directcompute and use the append buffer functionality

Can elaborate a little more on this becuase i was about to move from Brook+ to OpenCL just because my performance on Brook+ is pathetic because i have to filter a stream of 8192 x 8192 size for the results and i call the kernel number of time .

But now cannot use OpenCL as well as 9270 does not atomic operation so no point going for OpenCL as well.

LeeHowes · ‎06-24-2010

So in the first pass you've output multiple values, some of which are valid and some aren't, and there's some way to test for that, right? So let's assume we have an array of 1s and 0s where the 1s are the valid outputs. What you want to know for a compaction pass is how many 1s are present before your location, so you can do a kernel that does:

__kernel( int prefixSum, ThingToOutput things, ThingToOutput compactedDestination[someSmallerN] )

{

if( valid(things) )

compactedDestination[prefixSum] = things;

}

So it knows where to output because it knows how many valid entries there were before it in the array. You can obtain this data by doing a prefix sum, in serial form that would be:

int sum = 0;

for( int i = 0 to n )

{

prefixSum = sum;

if( valid( things ) )

sum++;

}

Of course, for the GPU you want to do this prefix sum in parallel. You can do that using a reduction tree where each workgroup on the GPU takes a block and does a prefix sum internally to work out how many entries were present, you can do that by using neighbour-wise addition with a gradually increasing stride (there are efficient ways to write the code). Each workgroup outputs the final sum from its block of data, and the set of internal sums.

You then need to do a sum of those sums, and do this hierarchically until you have a final result.

You now have a list of block sums which you can add back to the original set of blocks to compute the global prefix sum values. This final stage could be trivially merged with your compaction stage to reduce the number of kernel launches.

This approach isn't as work-efficient as the atomic approach, but it's more widely useable, ordered, and pretty efficient overall. It's at the heart of the radix sort among other things. Indeed, a compaction like this can be seen as a 1-bit radix sort.

Does that make any sense? There's a very basic and not very efficient radix sort in the current stream SDK that should give you a clue. The one I wrote some time back will hopefully be put in there at some point once it's tidied up.

niravshah00 · ‎06-25-2010

Hi Lee,

Thanks for the explanation

I have the exact same thing i mean i have 3d stream and in the kernel i write one to location where there is output or else zero.(Using Brook+ for my project) The location in the 3d stream are my actual solution. i mean instance().x,instance().y,instance().z are solution.

Now i understood the reduction kernel that you suggest here but i did not understand how you used prefix sum to get the location on the output array. The location where the solution is to be written.

if I do the prefix sum wouldn't that be the same as going through the entire stream which is the thing i want to avoid.Since this a bottle neck in my performance.

I hope you understand what is my problem here

LeeHowes · ‎06-25-2010

It would mean going over the data in the stream, yes. It can be done in parallel, though, so you don't have a serial check as a bottleneck. That may still be a problem for you, but if that approach doesn't work you may have to think again about how you generate the data in the first place.

The prefix sum lets you go from data like this:

0 0 1 1 0 1 0 1 1 0 1

To knowing how many 1s there were:

0 0 0 1 2 2 3 3 4 5 5... and 6 in total

To then being able to write those out:

n n 0 1 n 2 n 3 4 n 5

So the WIs with 1s know what address to write to, and the total sum tells you how much data yo uhave in total: 6 entries.

niravshah00 · ‎06-25-2010

Could you give me an example code

All i have is a 3d stream with 1's and 0's .

Now i want to capture the location in the stream (x,y,z) index of stream where there are ones.

LeeHowes · ‎06-25-2010

I'll assume you can look up parallel prefix sum code, it'd take me too long to type out.

But let's say you have a 3D array of 1s and 0s saying "valid" or "invalid", you also store with each object you output the location in the array.

So then you go through the 3D array doing a prefix sum, in this case we'll do it trivialy as a serial loop and best to do it in 1D because you're going to pack into a 1D array anyway. Bear in mind that you can treat the 3D array as a 1D array for the prefix sum anyway so just look up parallel 1D prefix sums, as long as you know the 3D address you can pretend it's not 3D.

sum = 0

for x, y, z (as a 1D loop, do it row major or whatever you like as long as you're consistent):

if( valid(x, y, z) 😞

prefixSum(x, y, z) = sum

sum = sum + 1

So you have a 3D grid of prefix sums saying where, if compacted into a 1D array, you will write the data. You then, in parallel over your data do something like:

parallel_for(x, y, z):

packed_array_to_copy(prefixSum(x, y, z)).address = int3(x, y, z)

packed_array_to_copy(prefixSum(x, y, z)).data = the valid object

You copy packed_array_to_copy to the CPU. You now can iterate over it:

for( i in packed_array_to_copy.size() 😞

CPU_side_3D_array(packed_array_to_copy(i).address) = packed_array_to_copy(i).data

Which of course means going over the data yet again, but only the valid data this time. Whether it's beneficial or not depends on how few valid items there are compared with the invalid ones, how fast the data transfer is etc etc.

Any clearer? I may be completely misunderstanding your problem of course.

niravshah00 · ‎06-25-2010

Ok ,

I get it now

Step 1 : get the stream with 1's and 0's from the original kernel (my kernel) on to CPU in an array (which would be ofcourse 1D)

Step 2 : Send this array to caluculate prefix sum(another kernel can do this in parallel Right???

Step 3: Call a reduction kernel with the original stream and prefix sum now prefix sum would tell me which place to write the valid result

Step 4 : write this array to host again and print the result.

I hope my algorithm is right .Let me know so that i can work over this on the weekend and get back to you if this does improve my performance . I hoping that this would improve the performance if not on CPU atleast on GPU

Thanks a lot Lee .

LeeHowes · ‎06-25-2010

Nono, don't copy it back. Just write a parallel prefix sum on the GPU to do the summation. I just showed you a serial one as an example to make the point. You can generate that data quite efficiently in log time on the GPU. You really want to do as little copying back as possible, it stalls the GPU and makes you have to do extra synchronisation. You'll just lose performance.

Leave the data on the GPU. Do the prefix sum on the GPU. Use the prefix sum array as input to the compaction kernel. Just copy the compacted data back with the indices for where the data should sit in the array once copied back.

niravshah00 · ‎06-25-2010

ok so use the same output stream of my kernel and pass i as input to prefix_sum kernel (not copying the stream on host)

Now we have the prefix_sum array which is used by as input for compaction kernel (again no copying of prefix_sum array to host)

Now just copy the compacted array back on host and display the result

niravshah00 · ‎06-26-2010

Now the prefix sum giving in th examples is written for 2d array and i have a 3d stream .

How does that work ?

LeeHowes · ‎06-29-2010

Just convert the addresses to 3D. The prefix sum is 1D. Your 3D dataset is just a 1D dataset that wraps in 3 dimensions.

1 2 3 4 5 6 7 8

in 2d is:

1 2

3 4

5 6

7 8

In 3d it's:

1 2

3 4

(then behind the 1 is the 5...)

5 6

7 8

So store the 3D addresses intead of 3D addresses with the valid data items and scatter them into 3D instead of 2D.

niravshah00 · ‎06-29-2010

Sorry Lee ,

But I don't think I understand the 2D prefix sum given in the smaples.

Don't understand the logic behind it because the wikki link for prefix sum algorithm is somewhat different from what is implemented in the samples.

LeeHowes · ‎06-29-2010

I didn't know there was a 2D prefix sum in the samples. There's a 1D one as well isn't there? If not you could look at the one from nvidia's SDK which is reasonably efficient and chapter 39 of GPU Gems 3 has an implementation described, as well. Note that in that case it's suboptimal because it does too much control checking for efficiency.

Forget 2D completely, as I said flatten your data into 1D for the prefix sum. It's a 1D array in memory after all. Multi-dimensional arrays are overrated (unless you're actually doing a 2D access pattern that's likely to get a performance benefit from the hardware).

What you want to do is divide the data into blocks each processed by a workgroup. Let's say each workgroup is going to process 64 elements and your workgroups are 64 in size. Allocate 64elements of local memory per group and initialise to the 64 values of your block you want to prefix sum.

Then you do (note the loop in the algorithm on the wikipedia page):

for( i = 0; i < 64; i *= 2 )

{

if( get_local_id(0) > i )

localData[get_local_id(0)] += localData[get_local_id(0)-i];

}

Now you output localData to a buffer, offset by 1, and you also output the total sum to another buffer which contains the sums for each block.

So:

if( get_local_id(0) == 0 )

output = 0;

else

output = localData[get_local_id(0)-1];

outputdata[get_global_id(0)] = output;

if( get_local_id(0) == 0 )

blockSums[get_group_id(0)] = localData[64];

Now you do another pass that adds the block sums, clearly you want to do a prefix sum over those as well. You may need to do this as a tree. You then have another kernel pass that loads the prefix summed block id and adds that to each of the local data items in the block.

You now have a prefix sum of the entire dataset.

niravshah00 · ‎06-30-2010

Ok I was talking abt Brook+ sdk samples and you are talking about opencl ones

I have to do this in Brook+ .

I saw the openCL sample and that is a little simpler to understand though i dont understand what is done there.

I am trying to replicate the same in Brook+

what do do for barrier(CLK_LOCAL_MEM_FENCE);

LeeHowes · ‎06-30-2010

Oh, since you said you were wondering if OpenCL would be better I assumed you were now asking about that.

I know very little about Brook+ and haven't used it since back in the days when it was a Stanford research project.

niravshah00 · ‎06-30-2010

Hi ,

Thanks a lot for this help.

I really appreciate this. I will try an figure out a way from your suggestion.

Thanks

Nirav

MicahVillmow · ‎06-23-2010

There is no way to do this in Brook+ as atomic operations are not supported.

afo · ‎06-23-2010

Thanks a lot Micah!!

Your solution worked fine. As I said, this forum helps a lot.

best regards,

Alfonso

MicahVillmow · ‎06-24-2010

The firestream 9270 does not have atomic operations, so cannot use this method.

Archives Discussions

Variable output kernel?