Goal to the faster calculation, I try to change
clEnqueueNDRangeKernel API's parameter "*local_work_size".
I think the more "local_work_size" the faster processing.
But actually processing speed is saturated at
local_work_size = 8 to 16,
gradually slowdon more than local_work_size = 20.
I think it strange because A10-7850A has 512 stream processors.
There's no quick and easy answer to this question in general. Guessing is not the right thing to do.
Take a look at CodeXL if you haven't already.
Profile your application, then on the profile results, click on the "Kernel Occupancy" value of each call to be presented with various graphs which will allow to better understand what's going on as the work size changes.
agree with maxdz8
Thank you for reply.
I researched that the saturation is caused by Memory Access,
because Random Memory Access is spent 300ms/(100,000,000access),
and clEnqueueNDRangeKernel is spent 420ms/(100,000,000access).
In the case, data processing time depends on not local_work_size
but Memory Access.