1 Reply Latest reply on Apr 9, 2010 8:05 PM by nou

    Need help understanding profiler output

    drstrip

      My kernel does a cell by cell computation on a matrix. The matrix is square and is input as cl_int4, so the number of rows to the kernel is 4 times the number of columns. In the profiler output below, the input array is 4K x 4K, passed as 4k rows of 1K cols of cl_int4. Hence, the number of work_items is 4K x 1K. The work-group size is set to 1,1,1 since there is no sharing among the work-items (and I'm using a Firestream which has no shared memory, I believe). The profile returns this:

       

       GlobalWorkSize GroupWorkSize KernelTime LocalMem MemTransferSize ALU Fetch Write Wavefront ALUBusy ALUFetchRatio ALUPacking FetchUnitBusy FetchUnitStalled WriteUnitStalled
       {4096; 1024; 1} {1; 1; 1}2902.22102002734419430415.427.4184.77.6300

       

      If I divide the total number of work items (4k X 1k) by total wavefronts, I get only about 10 items/workfront. Can that be right? I thought I should be seeing something like 64 items/wavefront.

       

      Why am I seeing no local memory, when my kernel declares a number of variables locally, which the spec says default to private (and thus local memory)?

       

       

      Thanks.