Goal to the faster calculation, I try to change
clEnqueueNDRangeKernel API's parameter "*local_work_size".
I think the more "local_work_size" the faster processing.
But actually processing speed is saturated at
local_work_size = 8 to 16,
gradually slowdon more than local_work_size = 20.
I think it strange because A10-7850A has 512 stream processors.