16 Replies Latest reply on Dec 21, 2009 3:57 AM by gaurav.garg

    brcc hangs - subkernel with stream operator?

    drstrip

      The following (simplified) code causes brcc to hang:

      kernel void computeEnergy(uint4 old_spin[][], uint4 new_spin<>, out uint4 energy<>, int ROWS, int COLS)
      {
        int4 index = instance();

        <mumble>
      }


      kernel void updateSpin(uint4 spin_in<>, uint4 seeds_in<>, out uint4 spin_out<>, out uint4 seeds_out<>, int num_steps, uint num_spin_states, int ROWS, int COLS)
      {
        uint4 proposed_energy;
        uint4 proposed_spin;
        computeEnergy(spin_in, proposed_spin, proposed_energy, ROWS, COLS);
      }

       

      Is the hang being caused by the subkernel containing a stream (gather) operation? The manual is a little unclear on this - it says that kernels cannot call stream operators. Perhaps this was supposed to say subkernels cannot call stream operartors?

       

      If the gather stream in the subkernel is the source of my problem, how does one deal with gathers that should be in subkernels? Just inline the code in your kernel? Can you perform a gather on stream data that local to the kernel (ie, not an input)? Since you can't write to the input stream, how do you do a gather operation inside a loop, where each gather is performed on the result of the previous iteration?

        • brcc hangs - subkernel with stream operator?
          gaurav.garg

          As proposed_spin is uninitialized, shouldn't you modify the sub-kernel signature like this-

          kernel void computeEnergy(uint4 old_spin[][], out uint4 new_spin<>, out uint4 energy<>, int ROWS, int COLS)

            • brcc hangs - subkernel with stream operator?
              drstrip

              The code I posted is just to give the flavor.

              In the real code, some other ops initialize proposed_spin. But the question is really about why brcc hangs. The docs say you can't do a stream ops, but is that really intended to mean subkernels can't do stream ops? If so, how do you do stream ops in a loop if you want to update the value of stream var for each iteration of the loop since the input var is read-only.

               

                • brcc hangs - subkernel with stream operator?
                  gaurav.garg

                  I think the problem is in using instance() method inside sub-kernel.

                    • brcc hangs - subkernel with stream operator?
                      drstrip

                      With that latest hint, I'm making a little progress. The following example shows that indeed, you cannot have instance in a subkernel.

                      This compiles


                      kernel void subKernel(uint4 in_stream[], uint4 out out_stream<>)

                      {}

                      kernel void mainKernel(uint4 in_stream[], uint4 out out_stream<>)

                      {

                          subkernel(in_stream, out_steam);

                      }


                      Add an instance() call and it fails -


                      kernel void subKernel(uint4 in_stream[], uint4 out out_stream<>)

                      {

                         int4 indx = instance();

                      }

                      kernel void mainKernel(uint4 in_stream[], uint4 out out_stream<>)

                      {

                          subkernel(in_stream, out_steam);

                      }


                      OK, so we (at least I) now have learned that instance() is forbidden in a subkernel.  We can work around this limitation, as we can do the instance call in the main kernel and pass the value:


                      kernel void subKernel(uint4 in_stream[], int pos, out uint4 out_stream<>)
                      {
                        out_stream = in_stream[pos];
                      }


                      kernel void testKernel(uint4 in_stream[], out uint4 out_stream<>)
                      {
                        int4 indx = instance();
                        out_stream = in_stream[indx.x];
                        subKernel(in_stream, indx.x, out_stream);
                      }


                      Recall, however, that my goal is to perform an operation on the input stream data iteratively, updating the values each time through the loop. in_stream is read-only, so we can't operate on that. The obvious thought is to copy in_stream to some local var that we can write to. (I'll omit the subkernel code from here on out - it doesn't change)


                      kernel void testKernel(uint4 in_stream[], out uint4 out_stream<>)
                      {
                        int4 indx = instance();
                        uint4 local_stream = in_stream[indx.x];
                        subKernel(in_stream, indx.x, out_stream);
                      }


                      This compiles, so I can copy the in_stream to my local var. Now letls just replace in_stream in the subKernel call with local_stream -


                      kernel void testKernel(uint4 in_stream[], out uint4 out_stream<>)
                      {
                        int4 indx = instance();
                        uint4 local_stream = in_stream[indx.x];
                        subKernel(local_stream, indx.x, out_stream);

                      }


                      Now we get a compiler error complaining of an invalid cast. Apparently local_stream cannot be cast to the type of in_stream[] in the subkernel prototype. It's "type" is uint4, just like in the signature. It's enough of the same type as in_stream in the testKernel that I could assign to it with no cast problems. So, what's going on?

                        • brcc hangs - subkernel with stream operator?
                          gaurav.garg

                           

                          Now we get a compiler error complaining of an invalid cast. Apparently local_stream cannot be cast to the type of in_stream[] in the subkernel prototype. It's "type" is uint4, just like in the signature. It's enough of the same type as in_stream in the testKernel that I could assign to it with no cast problems. So, what's going on?
                          in_stream is a uint4 gather stream (and it is passed as uint4 var_name[] in sub-kernel parameters), but local_stream is a uint4 variable (and it must be passed as uint4 var_name or uint4 var_name<> in sub-kernel parameters)

                            • brcc hangs - subkernel with stream operator?
                              drstrip

                              how do I create a local var in the main kernel that can be passed as a uint4 gather stream to the subkernel?

                              The manual says that I can use an env var to specify read-write input streams, but suggests this is dangerous. Does that mean it doesn't work, may not work, is unreliable?

                                • brcc hangs - subkernel with stream operator?
                                  gaurav.garg

                                   

                                  how do I create a local var in the main kernel that can be passed as a uint4 gather stream to the subkernel?


                                  You cannot pass a local var as a gather stream in sub-kernel. The main gather stream must be directly passed to sub-kernel. There is no way you can write value on the gather stream.

                                   

                                  The manual says that I can use an env var to specify read-write input streams, but suggests this is dangerous. Does that mean it doesn't work, may not work, is unreliable?


                                  The manual is talking about a situation like this-

                                  kernel void test(float a<>, out float b<> )

                                  and then calling this kernel with the same stream as input and output.

                                  test(a, a);

                                    • brcc hangs - subkernel with stream operator?
                                      drstrip

                                      So this brings us full circle to my underlying question:

                                      Suppose you're trying to write a piece of code in which each element of an array updates it's state based on the values of it's neighbors. It's relatively straightforward to write this kernel using a gather stream. But now you want to loop over that update operation. Since the updated values cannot be written to the input gather stream, you can't just loop over the code you wrote in the once-through case. You can't create a local gather stream variable to pass to a subkernel. So how do you do it? Calling the kernel inside a loop running on the CPU means you have to pass data back and forth across the bus on every operation. If the computation has additional state along with the array itself, this can become extremely costly, killing any advantage of using the GPU.

                                       

                                      • brcc hangs - subkernel with stream operator?
                                        CaptainN

                                        drstrip,

                                        Actually you can. You indeed need to call kernels in a loop, but between kernel invokations, just re-assign output stream to input, and input to output. It will not cause any data movement around, just handle swap. In a "second" kernel invokation you will receive output stream as an input.

                                        The only problem here is that within 1 pass doing element i+1 you may not know whether element i has a update info, if element i+1 depends on element i. But this is general approach for parallel computing.

                                        Once you finish your iterations, then read the stream out from the stream which was used as an output stream.

                                         

                                         

                                          • brcc hangs - subkernel with stream operator?
                                            drstrip

                                            Captain N writes:
                                                You indeed need to call kernels in a loop, but between kernel invokations, just re-assign output stream to input, and input to output. It will not cause any data movement around, just handle swap. In a "second" kernel invokation you will receive output stream as an input.

                                            If I understand, you are suggesting something like this -

                                            int main(int argc, char** argv)
                                            {
                                              const int BUF_SIZE = 8192;
                                              int (*in_data)[BUF_SIZE]= new int [BUF_SIZE][BUF_SIZE];
                                              int (*out_data)[BUF_SIZE]= new int [BUF_SIZE][BUF_SIZE];

                                              CPerfCounter timer;

                                              unsigned int dims[2] = {BUF_SIZE, BUF_SIZE};

                                              brook::Stream< int>  in_stream(2, dims);
                                              brook::Stream< int>  out_stream(2, dims);

                                              timer.Reset();
                                              timer.Start();
                                              for (int i = 0; i < 10; ++i)
                                                in_stream.read(in_data);
                                              timer.Stop();

                                              std::cout << "Time to read stream = "<< timer.GetElapsedTime()/10  << std::endl;

                                              timer.Reset();
                                              timer.Start(); 

                                              for (int i = 0; i < 10; ++i)
                                              {
                                                testKernel(in_stream, out_stream);
                                                testKernel(out_stream, in_stream);
                                              }

                                              timer.Stop();
                                              std::cout << "Time to execute kernel = " << timer.GetElapsedTime()/10. << std::endl;
                                            }

                                            Let's use a trivial kernel

                                            kernel void testKernel(int in_stream<>, out int out_stream<>
                                            {
                                            return;
                                            }

                                            If no data is moved during the kernel call, then the second loop should take roughly the same amount of time regardless of the buffer size. However, that's not the case.

                                            BUF_SIZE     Time to read stream   Time to execute kernel
                                            1024            .0050                     .0023
                                            2048            .016                      .0045
                                            4096            .065                      .016
                                            8192            .43                       1.246

                                            (Times are in seconds).

                                            This strongly suggests to me that each call to the kernel involves a data transfer, making it very costly for large arrays passed to the kernel.

                                              • brcc hangs - subkernel with stream operator?
                                                gaurav.garg

                                                First of all, your performance measurement is wrong. Both streamRead and kernel calls are asynchrnous. Also, kernel call waits from streamRead to finish before kernel execution.

                                                So, your time measurement should be something like this-

                                                stream.finish();

                                                //timer_start();

                                                // operation on stream - stream.read() or strream.write()

                                                stream.finish();

                                                // timer_stop();

                                                If no data is moved during the kernel call, then the second loop should take roughly the same amount of time regardless of the buffer size. However, that's not the case.


                                                Consider the case of 2048 buffer size. If data transfer is taking place between two kernel calls, kernel call time should include 4 * .016 sec(streamWrite and Read for in_stream and out_stream) = .064 sec (> 0.0045) that is definitely not the case.

                                                  • brcc hangs - subkernel with stream operator?
                                                    drstrip

                                                    Some new experiments:

                                                    trivial kernel as above

                                                    relevant parts of caller look like

                                                    in_stream.finish()

                                                    timer.start();

                                                    in_stream.read();

                                                    for (i = 0; i < n; ++i)  // test with n = 10, 100

                                                    {

                                                       testKernel(in_stream, out_stream);

                                                       testKernel(out_stream, in_stream);

                                                    }

                                                    timer.stop;

                                                     

                                                    For n = 10, 100, the difference in execution time will represent the extra iterations of the loop, since each has the same stream.read().

                                                     

                                                    Copmute the kernel call time per loop iteration (hence two kernel calls) as (t_100 - t_10)/90. You get the following time per iteration:

                                                    1024 - .001576 secs

                                                    2048 - .003874 secs

                                                    4096 - .015275 secs

                                                    8192 - 1.3053 secs

                                                     

                                                    Once again, the times suggest that data is being transferred as part of the call, unless there is some other language feature I don't understand.

                                                    These times also allow you to compute the elapsed time for the stream.read() operation. The computed values are consistent with the values I get from direct timing using the following snippet:

                                                     

                                                      in_stream.finish();

                                                      timer.start();

                                                      for (int i = 0; i < 10 ; ++i)

                                                         in_stream.read();

                                                      in_stream.finish();

                                                      timer.stop;

                                                     

                                                      • brcc hangs - subkernel with stream operator?
                                                        gaurav.garg

                                                         

                                                        If no data is moved during the kernel call, then the second loop should take roughly the same amount of time regardless of the buffer size. However, that's not the case.


                                                         

                                                        Once again, the times suggest that data is being transferred as part of the call, unless there is some other language feature I don't understand.


                                                        I don't understand how do you reach this conclusion?

                                                          • brcc hangs - subkernel with stream operator?
                                                            drstrip

                                                            In my experiment I make 1 stream read call, then 10 pairs of kernel calls and collect the total time. I repeat the experiment, this time making 1 stream call and 100 pairs of kernel calls. The difference in time is equal to the time of making 90 pairs of kernel calls (except for some very small loop overhead). I performed this experiment for different stream sizes. That is the timing I reported in my previous post. Regardless of the cause, it is clear that calling an empty kernel with a larger stream takes more time than calling the same empty kernel with a smaller stream. What would cause this? I conjectured that some data transfer must be taking place. Perhaps this is wrong, but then what is causing the kernel with larger streams to take longer?

                                                            I have been avoiding looking at the compiler output, but maybe that's what it will come down to if we hope to understand what's going on.