cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

settle
Challenger

Persistent Threads and Wavefronts per CU

I created a sparse (dia format) linear solver (Biconjugate Gradient Stabilized Method) using persistent threads and profiled it using both the APP KernelAnalyzer and the APP Profiler.  The majority of the computational time is spent doing sparse matrix-vector multiplications, see kernel source code below.

__kernel void v_kernel(

    __global const double *A,

    __global const uint *I,

    __global const double *p,

    __global double *v)

{

    size_t i_begin;

    size_t i_end;

    size_t i_stride;

#if !defined(__CPU__) && !defined(__GPU__)

#error "Unsupported device type"

#elif defined(__CPU__)

    i_begin = (L + get_global_size(0) - 1) / get_global_size(0) * get_global_id(0);

    i_end = clamp((L + get_global_size(0) - 1) / get_global_size(0) * (get_global_id(0) + 1), (size_t) 0, (size_t) L);

    i_stride = 1;

#elif defined(__GPU__)

    i_begin = get_global_id(0);

    i_end = L;

    i_stride = get_global_size(0);

#endif

    for (size_t i = i_begin; i < i_end; i += i_stride)

    {

        double A_p = 0;

        for (size_t j = 0; j < P; ++j)

            A_p += A[i * LDA + j] * p[I[i * LDI + j]];

        v = A_p;

    }

}

If L = 129^3 and P = LDA = LDI = 27 are defined at compile time, then for my Tahiti GPU--Radeon HD 7950--I get the following times:

global_size (local_size = 64)Average Time (ms)
1792 (i.e., 1 wavefront / compute_unit)6.05359
3584 (i.e., 2 wavefronts / compute_unit)6.65148
7168 (i.e., 4 wavefronts / compute_unit)9.07413
14336 (i.e., 8 wavefronts / compute_unit)10.13020

Since the KernelAnalyzer tells me that the bottleneck is due to global fetching, I don't understand the above results.  The APP Programming Guide recommends to use as many wavefronts as possible (less than the limit, 40 for Tahiti, but in this case 8 due to VGPRs usage) to hide memory latencies.  The other kernels (daxpy types and reduction types) generally increase in speed as I use more wavefronts.

Would someone mind explaining what's going on and how I could increase performance?  Thanks!

0 Likes
7 Replies
notzed
Challenger

Read the section in the programming guide about coalesced memory access.  Ideally you want to have the memory accesses coalesced - each work item accesses adjacent memory locations.  Obviously you can't do this for a sparse array lookup, but the way A[] and I[] are accesses is nearly 'worst case', this is why the global fetching is the bottleneck.

The worst-case global memory access performance from a poor access pattern is many multiples less than the maximum, and you can't get enough threads to hide that much latency with the small amount of work being done.  Adding threads is probably just overloading the tiny bit of cache you have on each CU, and making it even worse.

The fix is easy - basically just transpose them in memory, so that the global work id is used as the column index, where the storage is row-major order.  The sparse access will still be sparse, but that too could be organised in a way that helps.

i.e. where possible you want work item 1 accessing the next memory address after the work item 0 is accessing, at the same time.

Remember that all work items in the local work-group are not 'independent threads' as in a CPU - they all execute the same instructions in lock-step.

PS The same rationale applies to a CPU implementation, and the code you have is written that way.

0 Likes

Hi notzed,

Thanks for the reply!  I too thought that the access pattern of A[] and I[] was causing the bottleneck, so like you suggested I transposed those matrices, e.g., A[j * LDA + i] but with LDA = L, likewise for I[], but the benchmark showed no significant difference so I concluded that the bottleneck was the sparse read from accessing p[].  However, in my case, I[j * LDI + i] is equivalent to I[j * LDI + i + 1], or I[j * LDI + i + 1] + 1, so accessing p[] should be coalesced, but maybe the caching system--if it even uses one--can't recognize it?

Any more ideas?  Also, similar to the loop in CPU implementation above, should work-groups for the GPU be spaced as far apart as possible to improve cache performance (i.e., reduce false cache dependencies)?  I know as you described work-items for GPU should perform contiguous memory operations, but I am not so sure about work-groups.

Thanks again

0 Likes

For GPUs, work-groups being spaced far apart may encounter global bank / channel conflicts.  What I don't know is if that's worse than false cache dependencies, assuming cache is used.

Also, I thought maybe it might help if I gave a simple example of A[] and I[].  Suppose P = 5 and L = 16 (a two-dimensional 4x4 mesh using a 5-point FDM scheme with boundary terms included in A[]).  Then A[] is

00100
00100
00100
00100
00100
-0.25-0.251-0.25-0.25
-0.25-0.251-0.25-0.25
00100
00100
-0.25-0.251-0.25-0.25
-0.25-0.251-0.25-0.25
00100
00100
00100
00100
00100

and I[] is

00014
00125
01236
02347
03458
14569
256710
367811
478912
5891013
69101114
710111215
811121315
912131415
1013141515
1114151515

Numbers in red above indicate indices that had to be clamped to between [0, L-1] to prevent segmentation faults.  An alternative to clamping would be to pad A[] with extra rows of zeros and then use a nonzero offset.

0 Likes

Hmm.

Out of curiosity I took that code and your description of I and created a test example, transposed is a bit faster, and it doesn't get slower with more wavefronts - by 8x it's twice as fast as non-transposed (but only about the same as with 1x)

It's using a lot of registers for such a small kernel - 120 odd.  It's unrolled the whole P loop I guess.  That might limit the concurrency with high wavefront counts.

I tried changing P to be a parameter, and that got the register use way down.  However, the code runs a lot slower (~10x???) - although the transposed case does get faster as you ramp up the number of wavefronts, and by 8 it's about the same.

If I can be calculated on the fly, that's about the only other thing I can think of.

I doubt it's global bank conflicts - i tried adding 1 to the stride for example, but tbh i've not looked into that too much since it's not been an issue for me.

A simpler loop which only calculates A_p += A[j*L+i] still takes nearly half the same time - so it's probably just bandwidth limited - you're processing a very very big array with few ALU ops, and there's basically no caching to be had.

(i'm using no valid data and only initialise I, but i get around 5ms on a 7970, so i presume i'm basically running the same thing).

transposed is a bit faster, and it doesn't get slower with more wavefronts - by 8x it's twice as fast as non-transposed (but only about the same as with 1x)

When I transposed it I stopped at 1x because as you saw too they're about the same.  I couldn't reason why they'd be the same if it was the bottleneck and transposing would provide more efficient memory access.  Anyways, I should've kept increasing the wavefronts.  Thanks for pointing that out.

I added a #pragma unroll 1 to try to stop the compiler from unrolling the entire P loop, but the number of VGPRs used are approximately the same (still using 1x wavefronts).  I can calculate I[] on the fly by changing __global uint *I to __constant int *I and then it would become Ap += A[j * LDA + i] * p[clamp(i + I, 0, L-1)].  My concern has been would this then slow down the CPU implementation.  I'll run some more tests and if it turns both are faster I'll switch to that indexing method.  I'll post the results.

By the way, do you have any experience using the async_work_group_copy functions or prefetch function?  I'm wondering if I could get any speedup using those if the hardware cache can't catch the access pattern.

0 Likes

I tried the pragma thing too but it just seemed to be ignored.

Lookup tables are  slow on modern cpus, particularly if they can't be cached - you can do an awful lot of alu in the time it takes to access system memory from a cache miss.  So it might be faster doing the arithmetic - and if not you could always just create 2 versions.

I haven't used async_work_group_copy because I normally just do the address arithmetic by hand to achieve the same result.  To me it always looked like an api added to interface with the DMA channel hardware on a CELL BE which was one of the early OpenCL targets (and without it, you wouldn't be able to get decent performance from one), and afaik GPU's just implement it as a copy loop anyway (might save some calculations though).

0 Likes

Sean Settle wrote:

I added a #pragma unroll 1 to try to stop the compiler from unrolling the entire P loop, but the number of VGPRs used are approximately the same (still using 1x wavefronts).  I can calculate I[] on the fly by changing __global uint *I to __constant int *I and then it would become Ap += A[j * LDA + i] * p[clamp(i + I, 0, L-1)].  My concern has been would this then slow down the CPU implementation.  I'll run some more tests and if it turns both are faster I'll switch to that indexing method.  I'll post the results.

Sorry for following up so late with this, but I finally got around to comparing  accessing array indices from a look-up table versus computing them on-the-fly.  My results showed that this made no significant difference for CPUs, but did improve the overall speed for GPUs for small global sizes.  However, for large global sizes there was only a slight improvement in speed for the GPUs.  Probably the best benefit of the on-the-fly method was the reduced memory footprint.