cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

drstrip
Journeyman III

Need help understanding profiler output

My kernel does a cell by cell computation on a matrix. The matrix is square and is input as cl_int4, so the number of rows to the kernel is 4 times the number of columns. In the profiler output below, the input array is 4K x 4K, passed as 4k rows of 1K cols of cl_int4. Hence, the number of work_items is 4K x 1K. The work-group size is set to 1,1,1 since there is no sharing among the work-items (and I'm using a Firestream which has no shared memory, I believe). The profile returns this:

 

 GlobalWorkSize GroupWorkSize KernelTime LocalMem MemTransferSize ALU Fetch Write Wavefront ALUBusy ALUFetchRatio ALUPacking FetchUnitBusy FetchUnitStalled WriteUnitStalled
 {4096; 1024; 1} {1; 1; 1}2902.22102002734419430415.427.4184.77.6300

 

If I divide the total number of work items (4k X 1k) by total wavefronts, I get only about 10 items/workfront. Can that be right? I thought I should be seeing something like 64 items/wavefront.

 

Why am I seeing no local memory, when my kernel declares a number of variables locally, which the spec says default to private (and thus local memory)?

 

 

Thanks.

 

 

 

 

0 Likes
1 Reply
nou
Exemplar

even if you do not share between workitem you should set local_size into NULL or 64 to increase performance because now itexecute only one work item per SIMD core which mean 10-20 work item at the time.

0 Likes