10 Replies Latest reply on Feb 4, 2009 11:20 AM by gaurav.garg

    gather & scatter


      I need to write a function using both gather and scatter. However, Brook+ does not support this? How the CAL indentify the thread domain?

      kernel void test(float4 a[], out float4 b[])


        • gather & scatter

          that is a good question that I want to ask also.

          The only simple example so far I can find (read and write to the global buffer) is importspeed in CAL.  The IL code reads from the global buffer first, does some computation and finally writes to the global buffer.  However it is kind of fake since the output is not read out and used for verification in that sample program.

          • gather & scatter
            If you are doing scatter/gather, only use one global buffer.
            As for examples that use global buffer:
            All of the compute samples
              • gather & scatter

                Can a kernel gather from streams of differerent sizes/dimentionality? I'm having issues with the following:


                kernel void stringMatch(int textStream[], int nextPosition,int hashTableStream[][], int hashIndex, out int2 resultStream<> {
                    int i;
                    int idx;
                    int index;
                    int x, y;
                    idx = instance().x;
                    index = hashTableStream[hashIndex][idx];
                    i = 0;
                    if (index < nextPosition) {
                        while (textStream[nextPosition+i]==textStream[index+i]) {
                            i = i+1;
                    resultStream.x = i;
                    resultStream.y = index;


                Just get zeros as the output.

                  • gather & scatter

                    Did you check error on your output stream?

                      • gather & scatter

                        OK was barking up the wrong tree. Doesn't like one of my input streams which is dimensioned [60000][500]. "Dimension not supported on the underlying hardware."

                        Is this a limitation of the HD2400 I'm still stuck with (4870 arriving imminently) or a more general limitation?


                          • gather & scatter

                            Maximum 2D stream dimensions supported is 8192x8192 and 1D dimensions suported is 2^26.

                            Either you can rearrange data to match these dimensions or you can try changing algorithm to execute data tile-by-tile on GPU (Take a look at out of core MMM in samples/CPP/apps). 4870 is also having the same limitation.

                              • gather & scatter

                                What's best practice in this situation? The gather routine will try to access elements that don't exist at the extremes of the domain.

                                (i) Is there a way to access the size of the stream from within the kernel and limit access with if statements to prevent accessing out of bounds?

                                (ii) Should the domain of the kernel be limited and the extremities handled separately?

                                (iii) or does the compiler deal with it so there isn't a problem?


                                kernel void gather(int a[], out int b<> {

                                    int idx = instance().x;

                                    b = a[idx-1] + a[idx] + a[idx+1];




                                  • gather & scatter
                                    (iii) For starters, I'm not sure that your kernel, they way it is now, will produce correct results. For example, given an array of size 10 where the value of the array equals the array index+1 for a, I get the output:


                                    So, for your outer limits, it's adding it's own value for [idx-1] and [idx+1] respectively. I'm assuming this is the way the compiler is handling "out of bounds" issues instead of giving an error. Seems odd to handle it that way, I would be interested to see why it is done that way.

                                    (i) You could add if statements but that would probably decrease your performance. It might be faster to create an extra lower and upper bound and set them to zero (or whatever you are looking for) and run over that new domain, then ignoring those.

                                    (ii) I'm not sure limiting the domain will help because the index in the domain for the kernel will still be out of bounds and you will end up with the same problem, if I'm understanding things correctly.

                                    • gather & scatter

                                      This sounds typical of GPU HW behaviour, the rule of the thumb as far as I know is that if the index exceeds the maximum (respectively minimum) limit, it is beeing clamped to its maximum (respectively minimum) value.

                                      So you don't have to test the boundaries, but you need to keep in mind that

                                      v[max_limit+whatever_positive] will be treated as v[max_limit]


                          • gather & scatter


                            Originally posted by: tgm@ncic.ac.cn I need to write a function using both gather and scatter. However, Brook+ does not support this? How the CAL indentify the thread domain?

                            kernel void test(float4 a[], out float4 b[])

                            Brook+ supports this. Go to CPP\tutorials\ScatterStreamKernel and change input stream to gather stream, it should work without any problem.