cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

obara
Journeyman III

How many work items A10-7850A's GPU runs the fastest

Goal to the faster calculation, I try to change

clEnqueueNDRangeKernel API's parameter "*local_work_size".

I think the more "local_work_size" the faster processing.

But actually processing speed is saturated at

  local_work_size = 8 to 16,

gradually slowdon more than local_work_size =  20.

I think it strange because A10-7850A has 512 stream processors.

What wrong?

0 Likes
3 Replies
maxdz8
Elite

There's no quick and easy answer to this question in general. Guessing is not the right thing to do.

Take a look at CodeXL if you haven't already.

Profile your application, then on the profile results, click on the "Kernel Occupancy" value of each call to be presented with various graphs which will allow to better understand what's going on as the work size changes.

0 Likes

agree with maxdz8

0 Likes

Thank you for reply.

I researched that the saturation is caused by Memory Access,

because Random Memory Access is spent 300ms/(100,000,000access),

and clEnqueueNDRangeKernel is spent 420ms/(100,000,000access).

In the case, data processing time depends on not local_work_size

but Memory Access.

0 Likes