My kernel does a cell by cell computation on a matrix. The matrix is square and is input as cl_int4, so the number of rows to the kernel is 4 times the number of columns. In the profiler output below, the input array is 4K x 4K, passed as 4k rows of 1K cols of cl_int4. Hence, the number of work_items is 4K x 1K. The work-group size is set to 1,1,1 since there is no sharing among the work-items (and I'm using a Firestream which has no shared memory, I believe). The profile returns this:
GlobalWorkSize | GroupWorkSize | KernelTime | LocalMem | MemTransferSize | ALU | Fetch | Write | Wavefront | ALUBusy | ALUFetchRatio | ALUPacking | FetchUnitBusy | FetchUnitStalled | WriteUnitStalled |
{4096; 1024; 1} | {1; 1; 1} | 2902.221 | 0 | | 200 | 27 | 34 | 4194304 | 15.42 | 7.41 | 84.7 | 7.63 | 0 | 0 |
If I divide the total number of work items (4k X 1k) by total wavefronts, I get only about 10 items/workfront. Can that be right? I thought I should be seeing something like 64 items/wavefront.
Why am I seeing no local memory, when my kernel declares a number of variables locally, which the spec says default to private (and thus local memory)?
Thanks.