3 Replies Latest reply on Sep 4, 2009 5:15 PM by Raistmer

    Need explanation about the nature of Brook+

    riza.guntur

      There is some problem that comes to my mind, one is answered just after "First Brook call is sloooooooowww" thread, the other still exist are:

      1. What actually happen in kernel invocation? I've read some about kernel compilation, but how long it kept in memory? I mean before the kernel got deleted since I've made two loop with same kernel function on both sides, when the first one finish and move to another loop with different scope of course it feels like stopped for a second. Those two kernel function are given one or two different input stream, not all. For example:

       for ( int epoch = 0; epoch < num_of_epoch; epoch++)

      {

       

       



       



      for( int row = 0; row < (int) yB; row++)

      {

      myufy_wrong(row, alpha,*fuzzy_number,*vec_ref,myu);

       

      myu_min_all(myu,myu_min);



       



       

       //untuk dimensi kecil

       

      myu_max_min(myu_min,winner);

      calc_vec_ref_next(row, alpha, kappa, winner, *fuzzy_number, *vec_ref, vec_ref_next);

      copy4(vec_ref_next,*vec_ref);

      }

      alpha = 0.9999f * alpha;

      kappa = 0.01f * alpha;

      }

      Timer.Stop();

       

      printf(

       





       

       "GPU Time: %lf\n", Timer.GetElapsedTime());

       

       





      int num_of_row = (int) yD/jumlahDiSatuGrup;

       

       



      for( int row = 0; row < num_of_row; row++)

      {

      myufy_wrong(row, alpha,*fuzzy_number_test,*vec_ref,myu);

       

      myu_min_all(myu,myu_min);



       



       

       //untuk dimensi kecil

       

      myu_max_min(myu_min,winner);

      winner.write(winner_array);

       



       

       

       

       

      if(fabsf(winner_array[0].y -winner_array[0].z)<0.0001f)

      counter++;

       

      printf(



       





      "%f %f %f %f\n",winner_array[0].x,winner_array[0].y,winner_array[0].z,winner_array[0].w);

       2. The nature of kernel. How is the kernel call handled? Synchronous or asychronous? I read in User Guide, one kernel run at a time in a GPU but does one same kernel if called rapidly does it run in parallel in a GPU? There is no single explanation about this anywhere makes it hard to tell to anybody.

      3. I've seen some performance drops when near memory full like 490megs out of 512megs. What the cause of this?

      Thank you for beforehand :)

        • Need explanation about the nature of Brook+
          gaurav.garg

          1. The kernel gets deleted only when application exits or brook:: destroyAllDevices() method is called. StreamRead calls are asynchronous and Kernel has to wait for Stream::read to finish in case their is any dependency. You might be seeing that slowdown becuase of streamRead wait in your kernel invocation. You can should Stream::finsih() after Stream::Read() to measure correct timings.

          2. Kernel call is asynchorous and it returns after sending kernel execution command to GPU. GPU would finish first kernel call before executing second even if the same kernel is called twice. Also, Brook+ will wait for first kernel to finish if second kernel has any dependency on any output stream of first.

            • Need explanation about the nature of Brook+
              riza.guntur

              Thanks gaurav, that answer a lot of things

              For number 2, about kernel call if a kernel called twice, will the calls spooled in the hardware? Or CPU keep wait until device gives ready signal? Or CPU just calls rapidly and when device ready, the device execute the command then tells CPU that the call is executed?

              As for near full memory performance, I think the near full is exactly the culprit. I've read OMM gets around 540GFlops in 4096x4096 on 4870 (dunno how much gigs)

              • Need explanation about the nature of Brook+
                Raistmer
                2. Kernel call is asynchorous and it returns after sending kernel execution command to GPU. 

                Will it depend from domain size then? I see very big variation in kernel call times when domain size changes...