9 Replies Latest reply on Sep 10, 2009 10:00 AM by Raistmer

    How non-blocking kernel call is?

    Raistmer
      Kernel call time depends from domain size

      I measure time for kernel call itself.
      That is, number of ticks spent from kernel call statement to return control to calling program, not time to completion kernel (i.e., not time when isSync( will return true).

      As stated in brook manual kernel calls are async ones, that is they should return control "immediately".
      But this "immediately" strongly depends from domain size on that kernelk was called.
      Mean kernel call times vary from 7.68e+006 (with min=1.72e+005 and max=1.09e+008) to 1.19e+008 (min=2.25e+005 and max=9.77e+008).
      Time in clock ticks, measured on Q9450@2.66GHz.
      Domain size varied >~8 times.

      Again, it's not kernel completion time, it's just time from starting kernel to moment when control was returned to program.

      Why so strong domain size dependance for asynchronous call ???
        • How non-blocking kernel call is?
          gaurav.garg

          What to you mean by domain? domain of execution, domain operator on stream or just different output stream size?

            • How non-blocking kernel call is?
              Raistmer
              In this test I didn't use specific domain of execution so domain of execution == size of output stream == parameter that changed between kernel calls.
              No stream domain operator used here (so no inner memory copies).

              And I waited for input stream read operation will be finished.
                • How non-blocking kernel call is?
                  gaurav.garg

                  Did you use scatter stream?

                    • How non-blocking kernel call is?
                      Raistmer
                      No, kernel attached, call statement too.
                      With scatter stream I recived much worse timings so kernel was rearranged to use gather and ordinal streams only.

                      Any thoughts?

                      Kernel: kernel void GPU_fetch_array_kernel94t(float src[],int offsets[][],float freq[],out float4 dest<>){ //R: here we should form dest buffer element one by one, no arrays inside kernel allowed //So we should return back to initial fetch order that was so unoptimal for CPU version //will hope here access to memory with stride will not degrade performance so much as it was for CPU version int j=instance().y; int threadID=instance().x; int k=0; int l=0; float4 acc=float4(0.f,0.f,0.f,0.f); float f=freq[threadID]; //double period=periods[threadID];//(double)sub_buffer_size/(double)f; int n_per=(int)f; for(k=0;k<n_per;k++){ l=offsets[threadID][k]; l+=(4*j);//R: index to data array computed acc.x+=src[l]; acc.y+=src[l+1]; acc.z+=src[l+2]; acc.w+=src[l+3]; } dest=acc; } And this is kernel call: {Timings<T_Stream> cc; //fprintf(stderr,"gpu_offsets.before read\n"); {Timings<T_Stream1> c1; gpu_offsets->read(offsets); do{ Sleep(7);fprintf(stderr,"offsets_sleep\n"); }while(!gpu_offsets->isSync()); } //fprintf(stderr,"gpu_offsets.finish\n"); //GPU_fetch_array_kernel5.domainOffset(uint4(0,t_offset,0,0)); //GPU_fetch_array_kernel5.domainSize(uint4(1,sb_size,1,1)); {Timings<T_Stream2> c2; GPU_fetch_array_kernel94t(gpu_data,*gpu_offsets,/**gpu_per_int,*/gpu_freqs,*gpu_temp); } } do{ Sleep(sleep_fetch); }while(!gpu_temp->isSync()); Timings<> template constructor/destructor pair just call rdtsc and register ticks count difference. Its overhead ~1e2 ticks while kernel call took 1e6 to 1e8 ticks (see first post)

                        • How non-blocking kernel call is?
                          Raistmer
                          And another observation:
                          sometime I see only 1 (or less) ms delay for another kernel completion (when isSync returns true) but sometime it more than 43ms.
                          Benchmarked w/o any other load on system. And such times not first time call and not biggest domain calls (though biggest domains tend to show bigger timings of course, such sharp spikes not always assosiated with them).
                            • How non-blocking kernel call is?
                              gaurav.garg

                              If you think from graphics perspective, the bigger domain means a bigger rectangle that we want to render. As all the kernel processing is done in fragmenet stage (after rasterization) of the graphics pipeline. The time required in rasterization will be higher for large domains.

                                • How non-blocking kernel call is?
                                  Raistmer
                                  Originally posted by: gaurav.garg

                                  If you think from graphics perspective, the bigger domain means a bigger rectangle that we want to render. As all the kernel processing is done in fragmenet stage (after rasterization) of the graphics pipeline. The time required in rasterization will be higher for large domains.



                                  That is, that "rasterization" should be done on CPU?
                                  Why it blocks CPU?
                                  I explore not kernel execution time here itself, but only time needed by CPU to start kernel and return control to program.
                                  That is, this rasterization should be done synchronously, right?

                                  I just trying to find optimal conditions for freeing as much CPU as possible cause CPU will be busy with other task while my kernels will run on GPU...
                                    • How non-blocking kernel call is?
                                      Gipsel

                                       

                                      Originally posted by: Raistmer
                                      Originally posted by: gaurav.garg If you think from graphics perspective, the bigger domain means a bigger rectangle that we want to render. As all the kernel processing is done in fragmenet stage (after rasterization) of the graphics pipeline. The time required in rasterization will be higher for large domains.

                                       

                                      That is, that "rasterization" should be done on CPU? Why it blocks CPU? I explore not kernel execution time here itself, but only time needed by CPU to start kernel and return control to program. That is, this rasterization should be done synchronously, right? I just trying to find optimal conditions for freeing as much CPU as possible cause CPU will be busy with other task while my kernels will run on GPU...


                                      No, the rasterization is of course done on the GPU. It basically maps the domain of execution to the shader units (using a "hierarchical Z" shaped line as mentioned somewhere in the documentation). At least for pixel shader kernels, compute shaders should largely bypass this stage.

                                      But from my experience the rasterization isn't a real bottleneck for the throughput of somewhat longer kernels. It adds some latency (i.e. it should be more important for small execution domains), but that added latency shouldn't scale with the domain size as the rasterizer can work in parallel to the shader units. Prerequisites are that you have significantly more threads than execution units and that the execution time in the shader units is higher than the time needed for rasterization (so it can be hidden).

                                      Guarav may correct me if I'm wrong, but a least this is how I understand it.

                                        • How non-blocking kernel call is?
                                          Raistmer
                                          Well, question still open then...
                                          I see increase in "kernel preparation" (to distinguish from kernel run time) time with increase of execution domain even when only single instance of my app running.
                                          That is no low GPU memory conditions (discussed in another thread) apply here (total memory requirements far less than 512 MB and almost all memory allocated "statically" at beginning of program - Brook+ can easely cache buffers it needs and no need to change their sizes during app execution...)