Archives Discussions

boxerab · ‎02-03-2015

I have a kernel with work group size equal to 32. Is it safe to remove all local memory barriers, since

32 is <= size of a wavefront?

The following thread seems to imply that it is NOT good to remove local memory barriers, because work items

may be merged:

How to query wavefront size from kernel?

Thanks!

tzachi_cohen · ‎02-03-2015

No, it is not safe to remove barriers as long as you have more than one work item per work group. In any case, make your work group size 64 multiple lest you are seriously under-utilizing the GPU.

View solution in original post

tzachi_cohen · ‎02-03-2015

No, it is not safe to remove barriers as long as you have more than one work item per work group. In any case, make your work group size 64 multiple lest you are seriously under-utilizing the GPU.

boxerab · ‎02-04-2015

Thanks, Tzachi. Can you address the issue of work item merging? Is this why it is not safe to remove barriers for work groups

with size less than wave front size?

Also, my work group size needs to be 32 due to the algorithm I am using.

realhet · ‎02-04-2015

Hi,

I just tried it out on GCN: When workgroupsize is 32, then you'll have a whole wavefront for each workgroup, so half of the wavefront will be disabled by the 64bit exec mask.

When the workgroup fits in a single wavefront the there is no need of local mem barrier.

Make sure you aren't using more than 16KB of local mem though. (To be able to utilize all 4 vector simds in the compute units)

boxerab · ‎02-04-2015

@realhet cool! I only use about 1K of local memory. So, you are saying that workgroup size of 32 covers a whole wavefront.

I was under the impression that wavefront size is 64 on GCN.

So, I guess that if I target GCN, I can remove all of my barriers. Can someone from AMD confirm this ?

realhet · ‎02-05-2015

Yes, it is exactly 64 on gcn. I remember that on very old evergreen cards it was 32. Also on recent nvidia it is 32.

mrrvlad · ‎02-04-2015

in what case would I need to use a barrier (instead of mem_fence) when workgroup size is <= 64, assuming current GCN AMD GPU?

boxerab · ‎02-04-2015

Yes, good question. Can I get by with a memory fence instead of a barrier?

set · ‎02-04-2015

You always use a barrier where it's needed by algorithm. But if you know that workgroup size is fixed and fits in one wavefront – you hint the compiler to optimize it away by attributing your kernel with __attribute__((reqd_work_group_size(size)))

jason · ‎02-05-2015

Based on what I've seen / used it is safe. You may need to use mem_fence in some parts to make sure things are flushed when you need LDS consistency. I came to notice this trick after both Bolt (AMD sponsored/owned C++ stl like library) uses it in both their radix sort - as did some other academics amd looks to have collaborated with ( Takahiro Harada's radix sort) - I think they used this in their general scans too. You have to do some extra work to make sure you're fully used all compute units - I've only seen it give marginal gains in most uses - though maybe it shaved a millisecond or so on a 3 ms operation on a 7970. I've not used it on GCN and wasn't sure how well this "trick" would work there.

Btw I use macros to cover the difference:

clcommons/common.h at master · nevion/clcommons · GitHub

Archives Discussions

wave synchronous programming