Archives Discussions

obara · ‎05-01-2014

Goal to the faster calculation, I try to change

clEnqueueNDRangeKernel API's parameter "*local_work_size".

I think the more "local_work_size" the faster processing.

But actually processing speed is saturated at

local_work_size = 8 to 16,

gradually slowdon more than local_work_size = 20.

I think it strange because A10-7850A has 512 stream processors.

What wrong?

maxdz8 · ‎05-02-2014

There's no quick and easy answer to this question in general. Guessing is not the right thing to do.

Take a look at CodeXL if you haven't already.

Profile your application, then on the profile results, click on the "Kernel Occupancy" value of each call to be presented with various graphs which will allow to better understand what's going on as the work size changes.

sudarshan · ‎05-07-2014

agree with maxdz8

obara · ‎05-07-2014

Thank you for reply.

I researched that the saturation is caused by Memory Access,

because Random Memory Access is spent 300ms/(100,000,000access),

and clEnqueueNDRangeKernel is spent 420ms/(100,000,000access).

In the case, data processing time depends on not local_work_size

but Memory Access.

Archives Discussions

How many work items A10-7850A's GPU runs the fastest