cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

boxerab
Challenger

wave synchronous programming

I have a kernel with work group size equal to 32.  Is it safe to remove all local memory barriers, since

32 is <= size of a wavefront?

The following thread seems to imply that it is NOT good to remove local memory barriers, because work items

may be merged:

How to query wavefront size from kernel?

Thanks!

0 Likes
1 Solution

No, it is not safe to remove barriers as long as you have more than one work item per work group. In any case, make your work group size 64 multiple lest you are seriously under-utilizing the GPU.

View solution in original post

0 Likes
9 Replies

No, it is not safe to remove barriers as long as you have more than one work item per work group. In any case, make your work group size 64 multiple lest you are seriously under-utilizing the GPU.

0 Likes

Thanks, Tzachi. Can you address the issue of work item merging? Is this why it is not safe to remove barriers for work groups

with size less than wave front size?

Also, my work group size needs to be 32 due to the algorithm I am using.

0 Likes

Hi,

I just tried it out on GCN: When workgroupsize is 32, then you'll have a whole wavefront for each workgroup, so half of the wavefront will be disabled by the 64bit exec mask.

When the workgroup fits in a single wavefront the there is no need of local mem barrier.

Make sure you aren't using more than 16KB of local mem though. (To be able to utilize all 4 vector simds in the compute units)

0 Likes

@realhet cool!  I only use about 1K of local memory.  So, you are saying that workgroup size of 32 covers a whole wavefront.

I was under the impression that wavefront size is 64 on GCN.

So, I guess that if I target GCN, I can remove all of my barriers.  Can someone from AMD confirm this ?

0 Likes

Yes, it is exactly 64 on gcn. I remember that on very old evergreen cards it was 32. Also on recent nvidia it is 32.

0 Likes

in what case would I need to use a barrier (instead of mem_fence) when workgroup size is <= 64, assuming current GCN AMD GPU?

0 Likes

Yes, good question.  Can I get by with a memory fence instead of a barrier?

0 Likes

You always use a barrier where it's needed by algorithm. But if you know that workgroup size is fixed and fits in one wavefront – you hint the compiler to optimize it away by attributing your kernel with __attribute__((reqd_work_group_size(size)))

jason
Adept III

Based on what I've seen / used it is safe.  You may need to use mem_fence in some parts to make sure things are flushed when you need LDS consistency.  I came to notice this trick after both Bolt (AMD sponsored/owned C++ stl like library) uses it in both their radix sort - as did some other academics amd looks to have collaborated with ( Takahiro Harada's radix sort) - I think they used this in their general scans too.  You have to do some extra work to make sure you're fully used all compute units - I've only seen it give marginal gains in most uses - though maybe it shaved a  millisecond or so on a 3 ms operation on a 7970.  I've not used it on GCN and wasn't sure how well this "trick" would work there.

Btw I use macros to cover the difference:

clcommons/common.h at master · nevion/clcommons · GitHub

0 Likes