I created a sparse (dia format) linear solver (Biconjugate Gradient Stabilized Method) using persistent threads and profiled it using both the APP KernelAnalyzer and the APP Profiler. The majority of the computational time is spent doing sparse matrix-vector multiplications, see kernel source code below.
__kernel void v_kernel(
__global const double *A,
__global const uint *I,
__global const double *p,
__global double *v)
#if !defined(__CPU__) && !defined(__GPU__)
#error "Unsupported device type"
i_begin = (L + get_global_size(0) - 1) / get_global_size(0) * get_global_id(0);
i_end = clamp((L + get_global_size(0) - 1) / get_global_size(0) * (get_global_id(0) + 1), (size_t) 0, (size_t) L);
i_stride = 1;
i_begin = get_global_id(0);
i_end = L;
i_stride = get_global_size(0);
for (size_t i = i_begin; i < i_end; i += i_stride)
double A_p = 0;
for (size_t j = 0; j < P; ++j)
A_p += A[i * LDA + j] * p[I[i * LDI + j]];
v[i] = A_p;
If L = 129^3 and P = LDA = LDI = 27 are defined at compile time, then for my Tahiti GPU--Radeon HD 7950--I get the following times:
|global_size (local_size = 64)||Average Time (ms)|
|1792 (i.e., 1 wavefront / compute_unit)||6.05359|
|3584 (i.e., 2 wavefronts / compute_unit)||6.65148|
|7168 (i.e., 4 wavefronts / compute_unit)||9.07413|
|14336 (i.e., 8 wavefronts / compute_unit)||10.13020|
Since the KernelAnalyzer tells me that the bottleneck is due to global fetching, I don't understand the above results. The APP Programming Guide recommends to use as many wavefronts as possible (less than the limit, 40 for Tahiti, but in this case 8 due to VGPRs usage) to hide memory latencies. The other kernels (daxpy types and reduction types) generally increase in speed as I use more wavefronts.
Would someone mind explaining what's going on and how I could increase performance? Thanks!