25 Replies Latest reply on Jun 30, 2010 4:03 PM by niravshah00

    Variable output kernel?

    afo
      What could be the best approach to a variable output size from a kernel?

      Hi all,

      First of all, thanks a lot for your advices, reading this forum helps me a lot. But now I am facing a problem and I am not sure which could be the best approach:

      I have a kernel that solves a linear system of equations. the system is subdetermined, so the kernel generates many solutions. The vast majority of them are bad and a little subset are good.

      So I have a second kernel that receives all the solutions from the first kernel and checks if a solution is good or bad, adding a mark indicating good or bad. So the final output is a buffer filled with all the solutions plus the good/bad mark for everyone.

      This scheme generates large amounts of memory transfer from GPU to Host, so I am looking for a way to transfer only the good solutions.

      My first shoot is to generate a third kernel that copies only the good solutions from the output of the second kernel to an output buffer and puts the total number of good solutions in the first place of the output buffer. And in the host I will generate 2 memory transfers, one to read the total number of solutions, and the second one to transfer the output buffer with only the good solutions.

      I am not sure if it is a better approach to deal with this problem, so any insight about this will be very appreciated.

      best regards,

      Alfonso

        • Variable output kernel?
          niravshah00

          Hi ,

          I have the same problem but i am using Brook+ SDK 1.4.0

          So i thought OpenCL would have a better solution.

          But looking at your post tell me that i would get stuck even if i upgraded to OpenCL.

          Let me know if you found something about this problem.

           

          • Variable output kernel?
            MicahVillmow
            niravshah00/afo,
            Doing variable sized output is actually quite simple if you have a HD5XXX series card.
            1) Allocate of Type n an output buffer holding the maximum amount of memory that you will need to write to.
            2) Have a global pointer to a single integer and initialize it to zero.
            3) Execute your kernel and determine if the results should be written out or not
            3a) If the results are good, call atom_inc on the global int pointer, the return values is the index where you write your results
            3b) If the results are bad, don't write out your results
            4) Read the memory object that contains the global int pointer to find out how many values are good
            5) Allocate a memory buffer equal to the exact amount of memory required to copy the 'good' values
            6) Execute a kernel that copies data from your 'large' buffer to the exactly size buffer created in step 5.

            Now, if the int returns a value where it is almost equal to the max size, you can skip steps 5 and 6 and just copy the whole buffer back as the over head of creating a new buffer and executing a new kernel might not be worth it. However, if you have say a 100MB buffer allocated in step 1, but you only need to copy back 10KB of data, steps 5/6 might offer significant performance improvement.

            Also, you can improve step 5 by preallocating some buffers and then just select the buffer you want based on the size of your valid output data. The only drawback with this approach is that your initial output buffer must be the worst-case size all the time.
              • Variable output kernel?
                niravshah00

                How to implement this in Brook+?

                Or is there any workaround for this in Brook+

                • Variable output kernel?
                  niravshah00

                   

                  Originally posted by: MicahVillmow niravshah00/afo, Doing variable sized output is actually quite simple if you have a HD5XXX series card. 1) Allocate of Type n an output buffer holding the maximum amount of memory that you will need to write to. 2) Have a global pointer to a single integer and initialize it to zero. 3) Execute your kernel and determine if the results should be written out or not 3a) If the results are good, call atom_inc on the global int pointer, the return values is the index where you write your results 3b) If the results are bad, don't write out your results 4) Read the memory object that contains the global int pointer to find out how many values are good 5) Allocate a memory buffer equal to the exact amount of memory required to copy the 'good' values 6) Execute a kernel that copies data from your 'large' buffer to the exactly size buffer created in step 5. Now, if the int returns a value where it is almost equal to the max size, you can skip steps 5 and 6 and just copy the whole buffer back as the over head of creating a new buffer and executing a new kernel might not be worth it. However, if you have say a 100MB buffer allocated in step 1, but you only need to copy back 10KB of data, steps 5/6 might offer significant performance improvement. Also, you can improve step 5 by preallocating some buffers and then just select the buffer you want based on the size of your valid output data. The only drawback with this approach is that your initial output buffer must be the worst-case size all the time.


                  Is it possible on FireStream 9270 cards?

                  • Variable output kernel?
                    afo

                    Micah and developers:

                    I think that a variable output kernel example would be useful in the SDK examples, but it could be less important than other examples; so, perhaps it could be possible to create a "contributed examples" repository (as other projects do), where one could upload an example showing features less used (probably with AMD's approval). In this way, there will be a lot more examples available to learn and it will not load developer's time to maintain/test them.

                    best regards,

                    Alfonso

                      • Variable output kernel?
                        LeeHowes

                        The more standard way to do this in parallel programming would be to do a parallel prefix sum over the valid outputs and then run a kernel that compacts based on the prefix sum values. That would work without atomic support, it might also be faster than the atomic version because the atomic version needs the pre-op and would do a final unordered scatter stage. Better, still, it maintains order if this is important to you - you get a stable compaction/binary sort.

                        Or to do it in directcompute and use the append buffer functionality

                          • Variable output kernel?
                            niravshah00

                             

                            Originally posted by: LeeHowes The more standard way to do this in parallel programming would be to do a parallel prefix sum over the valid outputs and then run a kernel that compacts based on the prefix sum values. That would work without atomic support, it might also be faster than the atomic version because the atomic version needs the pre-op and would do a final unordered scatter stage. Better, still, it maintains order if this is important to you - you get a stable compaction/binary sort.

                             

                            Or to do it in directcompute and use the append buffer functionality

                             

                            Can elaborate a little more on this becuase i was about to move from Brook+ to OpenCL just because my performance on Brook+ is pathetic because i have to filter a stream of 8192 x 8192 size for the results and i call the kernel number of time .

                            But now cannot use OpenCL as well as 9270 does not atomic operation so no point going for OpenCL as well.

                              • Variable output kernel?
                                LeeHowes

                                So in the first pass you've output multiple values, some of which are valid and some aren't, and there's some way to test for that, right? So let's assume we have an array of 1s and 0s where the 1s are the valid outputs. What you want to know for a compaction pass is how many 1s are present before your location, so you can do a kernel that does:

                                 

                                __kernel( int prefixSum[n], ThingToOutput things[n], ThingToOutput compactedDestination[someSmallerN] )

                                {

                                  if( valid(things[n]) )

                                    compactedDestination[prefixSum[n]] = things[n];

                                }

                                So it knows where to output because it knows how many valid entries there were before it in the array. You can obtain this data by doing a prefix sum, in serial form that would be:

                                int sum = 0;

                                for( int i = 0 to n )

                                {

                                  prefixSum[n] = sum;

                                  if( valid( things[n] ) )

                                    sum++;

                                }

                                 

                                Of course, for the GPU you want to do this prefix sum in parallel. You can do that using a reduction tree where each workgroup on the GPU takes a block and does a prefix sum internally to work out how many entries were present, you can do that by using neighbour-wise addition with a gradually increasing stride (there are efficient ways to write the code). Each workgroup outputs the final sum from its block of data, and the set of internal sums.

                                You then need to do a sum of those sums, and do this hierarchically until you have a final result.

                                You now have a list of block sums which you can add back to the original set of blocks to compute the global prefix sum values. This final stage could be trivially merged with your compaction stage to reduce the number of kernel launches.

                                This approach isn't as work-efficient as the atomic approach, but it's more widely useable, ordered, and pretty efficient overall. It's at the heart of the radix sort among other things. Indeed, a compaction like this can be seen as a 1-bit radix sort.

                                 

                                Does that make any sense? There's a very basic and not very efficient radix sort in the current stream SDK that should give you a clue. The one I wrote some time back will hopefully be put in there at some point once it's tidied up.

                                  • Variable output kernel?
                                    niravshah00

                                    Hi Lee,

                                    Thanks for the explanation

                                    I have the exact same thing i mean i have  3d stream and in the kernel i write one to location where there is output or else zero.(Using Brook+ for my project) The location in the 3d stream are my actual solution. i mean instance().x,instance().y,instance().z are solution.

                                    Now i understood the reduction kernel that you suggest here but i did not understand how you used prefix sum to get the location on the output array. The location where the solution is to be written.

                                    if I do the prefix sum wouldn't  that be the same as going through the entire stream which is  the thing i want to avoid.Since this a bottle neck in my performance.

                                    I hope you understand what is my problem here

                                      • Variable output kernel?
                                        LeeHowes

                                        It would mean going over the data in the stream, yes. It can be done in parallel, though, so you don't have a serial check as a bottleneck. That may still be a problem for you, but if that approach doesn't work you may have to think again about how you generate the data in the first place.

                                        The prefix sum lets you go from data like this:

                                        0 0 1 1 0 1 0 1 1 0 1

                                        To knowing how many 1s there were:

                                        0 0 0 1 2 2 3 3 4 5 5... and 6 in total

                                        To then being able to write those out:

                                        n n 0 1 n 2 n 3 4 n 5

                                        So the WIs with 1s know what address to write to, and the total sum tells you how much data yo uhave in total: 6 entries.

                                          • Variable output kernel?
                                            niravshah00

                                            Could you give me an example code

                                            All i have is a 3d stream with 1's and 0's .

                                            Now i want to capture the location in the stream (x,y,z) index of stream where there are ones.

                                             

                                              • Variable output kernel?
                                                LeeHowes

                                                I'll assume you can look up parallel prefix sum code, it'd take me too long to type out.

                                                But let's say you have a 3D array of 1s and 0s saying "valid" or "invalid", you also store with each object you output the location in the array.

                                                So then you go through the 3D array doing a prefix sum, in this case we'll do it trivialy as a serial loop and best to do it in 1D because you're going to pack into a 1D array anyway. Bear in mind that you can treat the 3D array as a 1D array for the prefix sum anyway so just look up parallel 1D prefix sums, as long as you know the 3D address you can pretend it's not 3D.

                                                sum = 0

                                                for x, y, z (as a 1D loop, do it row major or whatever you like as long as you're consistent):

                                                  if( valid(x, y, z) ):

                                                    prefixSum(x, y, z) = sum

                                                    sum = sum + 1

                                                 

                                                So you have a 3D grid of prefix sums saying where, if compacted into a 1D array, you will write the data. You then, in parallel over your data do something like:

                                                parallel_for(x, y, z):

                                                  packed_array_to_copy(prefixSum(x, y, z)).address = int3(x, y, z)

                                                  packed_array_to_copy(prefixSum(x, y, z)).data = the valid object

                                                 

                                                You copy packed_array_to_copy to the CPU. You now can iterate over it:

                                                for( i in packed_array_to_copy.size() ):

                                                  CPU_side_3D_array(packed_array_to_copy(i).address) = packed_array_to_copy(i).data

                                                 

                                                Which of course means going over the data yet again, but only the valid data this time. Whether it's beneficial or not depends on how few valid items there are compared with the invalid ones, how fast the data transfer is etc etc.

                                                Any clearer? I may be completely misunderstanding your problem of course.

                                                  • Variable output kernel?
                                                    niravshah00

                                                    Ok ,

                                                    I get it now

                                                    Step 1 : get the stream with 1's and 0's from the original kernel (my kernel) on to CPU in an array (which would be ofcourse 1D)

                                                    Step 2 : Send this array to caluculate prefix sum(another kernel  can do this in parallel Right???

                                                    Step 3: Call a reduction kernel with the original stream and prefix sum  now prefix sum would tell me which place to write the valid result

                                                    Step 4 : write this array to host again and print the result.

                                                    I hope my algorithm is right .Let me know so that i can work over this on the weekend and get back to you if this does improve my performance . I hoping that this would improve the performance if not on CPU atleast on GPU

                                                    Thanks a lot Lee .

                                                     

                                                      • Variable output kernel?
                                                        LeeHowes

                                                        Nono, don't copy it back. Just write a parallel prefix sum on the GPU to do the summation. I just showed you a serial one as an example to make the point. You can generate that data quite efficiently in log time on the GPU. You really want to do as little copying back as possible, it stalls the GPU and makes you have to do extra synchronisation. You'll just lose performance.

                                                        Leave the data on the GPU. Do the prefix sum on the GPU. Use the prefix sum array as input to the compaction kernel. Just copy the compacted data back with the indices for where the data should sit in the array once copied back.

                                                          • Variable output kernel?
                                                            niravshah00

                                                            ok so use the same  output  stream of my kernel  and pass i as input to prefix_sum kernel (not copying the stream on host)

                                                            Now we have the prefix_sum array which is used by as input for compaction kernel (again no copying of prefix_sum array to host)

                                                            Now just copy the compacted array back on host and display the result

                                                              • Variable output kernel?
                                                                niravshah00

                                                                Now the prefix sum giving in th examples is written for 2d array and i have a 3d stream .

                                                                How does that work ?

                                                                  • Variable output kernel?
                                                                    LeeHowes

                                                                    Just convert the addresses to 3D. The prefix sum is 1D. Your 3D dataset is just a 1D dataset that wraps in 3 dimensions.

                                                                     

                                                                    1 2 3 4 5 6 7 8

                                                                    in 2d is:

                                                                    1 2

                                                                    3 4

                                                                    5 6

                                                                    7 8

                                                                     

                                                                    In 3d it's:

                                                                    1 2

                                                                    3 4

                                                                    (then behind the 1 is the 5...)

                                                                    5 6

                                                                    7 8

                                                                     

                                                                    So store the 3D addresses intead of 3D addresses with the valid data items and scatter them into 3D instead of 2D.

                                                                      • Variable output kernel?
                                                                        niravshah00

                                                                        Sorry Lee ,

                                                                        But I don't think I understand the 2D prefix sum given in the smaples.

                                                                        Don't understand the logic behind it because the wikki link for prefix sum algorithm is somewhat different from what is implemented in the samples.

                                                                          • Variable output kernel?
                                                                            LeeHowes

                                                                            I didn't know there was a 2D prefix sum in the samples. There's a 1D one as well isn't there? If not you could look at the one from nvidia's SDK which is reasonably efficient and chapter 39 of GPU Gems 3 has an implementation described, as well. Note that in that case it's suboptimal because it does too much control checking for efficiency.

                                                                            Forget 2D completely, as I said flatten your data into 1D for the prefix sum. It's a 1D array in memory after all. Multi-dimensional arrays are overrated (unless you're actually doing a 2D access pattern that's likely to get a performance benefit from the hardware).

                                                                            What you want to do is divide the data into blocks each processed by a workgroup. Let's say each workgroup is going to process 64 elements and your workgroups are 64 in size. Allocate 64elements of local memory per group and initialise  to the 64 values of your block you want to prefix sum.

                                                                            Then you do (note the loop in the algorithm on the wikipedia page):

                                                                            for( i = 0; i < 64; i *= 2 )

                                                                            {

                                                                              if( get_local_id(0) > i )

                                                                                localData[get_local_id(0)] += localData[get_local_id(0)-i];

                                                                            }

                                                                            Now you output localData to a buffer, offset by 1, and you also output the total sum to another buffer which contains the sums for each block.

                                                                            So:

                                                                            if( get_local_id(0) == 0 )

                                                                              output = 0;

                                                                            else

                                                                              output = localData[get_local_id(0)-1];

                                                                            outputdata[get_global_id(0)] = output;

                                                                            if( get_local_id(0) == 0 )

                                                                              blockSums[get_group_id(0)] = localData[64];

                                                                             

                                                                            Now you do another pass that adds the block sums, clearly you want to do a prefix sum over those as well. You may need to do this as a tree. You then have another kernel pass that loads the prefix summed block id and adds that to each of the local data items in the block.

                                                                            You now have a prefix sum of the entire dataset.

                                              • Variable output kernel?
                                                MicahVillmow
                                                There is no way to do this in Brook+ as atomic operations are not supported.
                                                • Variable output kernel?
                                                  MicahVillmow
                                                  The firestream 9270 does not have atomic operations, so cannot use this method.