AnsweredAssumed Answered

Persistent Threads and Wavefronts per CU

Question asked by settle on Oct 10, 2012
Latest reply on Dec 11, 2012 by settle

I created a sparse (dia format) linear solver (Biconjugate Gradient Stabilized Method) using persistent threads and profiled it using both the APP KernelAnalyzer and the APP Profiler.  The majority of the computational time is spent doing sparse matrix-vector multiplications, see kernel source code below.

__kernel void v_kernel(

    __global const double *A,

    __global const uint *I,

    __global const double *p,

    __global double *v)


    size_t i_begin;

    size_t i_end;

    size_t i_stride;


#if !defined(__CPU__) && !defined(__GPU__)

#error "Unsupported device type"

#elif defined(__CPU__)

    i_begin = (L + get_global_size(0) - 1) / get_global_size(0) * get_global_id(0);

    i_end = clamp((L + get_global_size(0) - 1) / get_global_size(0) * (get_global_id(0) + 1), (size_t) 0, (size_t) L);

    i_stride = 1;

#elif defined(__GPU__)

    i_begin = get_global_id(0);

    i_end = L;

    i_stride = get_global_size(0);



    for (size_t i = i_begin; i < i_end; i += i_stride)


        double A_p = 0;

        for (size_t j = 0; j < P; ++j)

            A_p += A[i * LDA + j] * p[I[i * LDI + j]];

        v[i] = A_p;



If L = 129^3 and P = LDA = LDI = 27 are defined at compile time, then for my Tahiti GPU--Radeon HD 7950--I get the following times:

global_size (local_size = 64)Average Time (ms)
1792 (i.e., 1 wavefront / compute_unit)6.05359
3584 (i.e., 2 wavefronts / compute_unit)6.65148
7168 (i.e., 4 wavefronts / compute_unit)9.07413
14336 (i.e., 8 wavefronts / compute_unit)10.13020

Since the KernelAnalyzer tells me that the bottleneck is due to global fetching, I don't understand the above results.  The APP Programming Guide recommends to use as many wavefronts as possible (less than the limit, 40 for Tahiti, but in this case 8 due to VGPRs usage) to hide memory latencies.  The other kernels (daxpy types and reduction types) generally increase in speed as I use more wavefronts.


Would someone mind explaining what's going on and how I could increase performance?  Thanks!