cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

cocular
Journeyman III

The definition about clk_global_mem_fence and mem_fence and their effect upon performance

Hello!

  Recently I want to implement a priority queue in OpenCL and and some doubt about barrier and mem_fence.  Here is my understanding.

  1. barrier(clk_global_mem_fence):
    1. It makes sure all the work-items in same work-groups reach this barrier
    2. It makes sure that all the write to global memory in current work-item can be read correctly by other work-item in the same work-group after the barrier.
  2. mem_fence:
    1. It makes sure that all the write in current work-item can be correctly read by the this work-item after the fence

Do I miss something?  Am I right?

Now for the performance issue.  I read the AMD Accelerated Parallel Processing OpenCL Programming Guide.  On page 136, there is a cache hierarchical figure.  I see that there is a L1 cache per compute unit.  Since one work-group can only be fitted in one compute unit, In a global barrier, GPU does not need wait 400+cycle to make sure the write done in the global memory but only want so that write on L1 completes?

If that is the case, what is the latency if L1 hits? Or L2 hits?

0 Likes
16 Replies
realhet
Miniboss

Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance

Hi,

I think you understand it well, but on 2.: I think you wanted to write: "It makes sure that (all) the write in current work-item can be correctly read by the this work-group(!!!) after the fence"

Not sure, but if you're looking for global synchronization, you can use memory atomic operations.


For the memory performance: In APP OpenCL guide there is an Optiomization section (chapter 5 and 6).

At "5.4 OpenCL Memory Resources: Capacity and Performance" there are the various bandwidths for the HD7970 GCN chip.

Also for VLIW HD5870 -> "6.4 OpenCL Memory Resources: Capacity and Performance"

0 Likes
himanshu_gautam
Grandmaster

Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance

You are right about barriers and mem_fence.

From performance perpective, your answers also lie in the programming guide itself.

The L1 cache can service up to four address requests per cycle, each delivering

up to 16 bytes. The bandwidth shown assumes an access size of 16 bytes;

smaller access sizes/requests result in a lower peak bandwidth for the L1 cache.

Using float4 with images increases the request size and can deliver higher L1

cache bandwidth.

Each memory channel on the GPU contains an L2 cache that can deliver up to

64 bytes/cycle. The AMD Radeon™ HD 7970 GPU has 12 memory channels;

thus, it can deliver up to 768 bytes/cycle; divided among 2048 stream cores, this

provides up to ~0.4 bytes/cycle for each stream core.

0 Likes
cocular
Journeyman III

Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance

For 2, I really want to write

... by the this work-item after the fence.

I'm not clear whether it is guarantee for the current work-group?

0 Likes
cocular
Journeyman III

Re: Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance

I think I do not state my question clearly.  Consider the following code:


int data;



global_mem[global_id] = data;


barrier(clk_global_mem_fence)


So the write will go into L1.  Since barrier will only make sure the sync between current work-groups, barrier will only have a very little cost. (the latency of L1), not 400+ cycle, won't it?

Also what does "per cycle" above means? the latency or bandwidth?  Each device has a table for its bandwidth for its memory device.  Is there a table for the latency in memory operation?  Or and bandwidth and latency for arithmetic operation?

0 Likes
himanshu_gautam
Grandmaster

Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance

I did got your point.Not sure if latencies are mentioned anywhere in the guide. Will ask about them.

Also I cannot say a answer with confidance for the question regarding the latency of the above code snippet too.

0 Likes
cocular
Journeyman III

Re: Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance

I have done some benchmark and find some unexpected result.

Me benchmark code copy 256*repeat_time data to GPU memory.  In my test, repeat_time = 4096.  I do add __kernel __attribute__((reqd_work_group_size(4, 1, 1))) to notify OpenCL that I will run it on less than one wavefront to test latency.

Code 1: use local barrier.

Code 2: use global barrier.

Code 3: no barrier.

Result: on HD6770 mobile:

Code 1: 64ms

Code 2: 500ms

Code3: less than 1 ms

Now there are several problem:

1.  I do not use local memory.  And there is only ONE wavefront (I write reqd_work_group_size).  Why local barrier will significantly decrease the performance?  It seems that it is just the instruction overhead.  Pure 256*4096 times of local mem_fence still need such time.

2.   global barrier is too slow!  The latency is about 500 cycle!.  Obviously, barrier() is stupid, L1 and ONE wavefront is not utilized.  Or L1 has latency 500 cycle? OK.  I'm wrong.  Global operation is not cached. This is normal.

How can I optimize the global write with group sync?  One way is manual maintain a cache in local memory and only do local barrier, which I think is too hard to implement.  Any advice?

Code1:


__kernel __attribute__((reqd_work_group_size(4, 1, 1)))
void memtest(__global int *global_mem, int repeat_time)
{
    int data = get_global_id(0);

    for (int i = 0; i < repeat_time; ++i) {
        global_mem[data + 1  ] = data;
        barrier(CLK_LOCAL_MEM_FENCE);

        global_mem[data + 2  ] = data;
        barrier(CLK_LOCAL_MEM_FENCE);

        global_mem[data + 3  ] = data;
        barrier(CLK_LOCAL_MEM_FENCE);

        ...

        global_mem[data + 255] = data;
        barrier(CLK_LOCAL_MEM_FENCE);

        global_mem[data + 256] = data;
        barrier(CLK_LOCAL_MEM_FENCE);
    }
}







Code2:


__kernel __attribute__((reqd_work_group_size(4, 1, 1)))
void memtest(__global int *global_mem, int repeat_time)
{
    int data = get_global_id(0);

    for (int i = 0; i < repeat_time; ++i) {
        global_mem[data + 1  ] = data;
        barrier(CLK_GLOBAL_MEM_FENCE);

        global_mem[data + 2  ] = data;
        barrier(CLK_GLOBAL_MEM_FENCE);

        global_mem[data + 3  ] = data;
        barrier(CLK_GLOBAL_MEM_FENCE);

        ...

        global_mem[data + 255] = data;
        barrier(CLK_GLOBAL_MEM_FENCE);

        global_mem[data + 256] = data;
        barrier(CLK_GLOBAL_MEM_FENCE);
    }
}







Code3:


__kernel __attribute__((reqd_work_group_size(4, 1, 1)))
void memtest(__global int *global_mem, int repeat_time)
{
    int data = get_global_id(0);

    for (int i = 0; i < repeat_time; ++i) {
        global_mem[data + 1  ] = data;


        global_mem[data + 2  ] = data;


        global_mem[data + 3  ] = data;


        ...

        global_mem[data + 255] = data;


        global_mem[data + 256] = data;

    }
}







0 Likes
himanshu_gautam
Grandmaster

Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance

You are setting required workgroup size as just 4 work-items, and it is clear that different workgroups are writing into same memory locations, hence i would assume global writes must be done all the way to the global memory. For example global_id=0 modifies:1-256 index elements, global_id=4 (from 2nd WG) modifies: 5-260 index elements. Probably create a small section of global memory dedicated to each workgroup. It may give good performance.

I would believe Code 3, can get optimized away to a good degree.There are large write conflicts, so hard to say, what it will get compiled to.

EDIT: Also latencies are not a important measure to know, as they are always hidden. it should just be understood, that caches have very less latency as compared to global memories. Do you have idea about cache latencies from NVIDIA?

0 Likes
cocular
Journeyman III

Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance

For the workgroup problem: there is only one workgroup to avoid problem in this banchmark.

The key point is that: the global memory buffer do NOT have a cache. L1 and L2 are only for texture memory.  So the large latency is somehow reasonable here.

0 Likes
himanshu_gautam
Grandmaster

Re: The definition about clk_global_mem_fence and mem_fence and their effect upon performance



The key point is that: the global memory buffer do NOT have a cache. L1 and L2 are only for texture memory.  So the large latency is somehow reasonable here.


This is not true. Caching is used for global memory too. Check globalMemoryBandwidth sample, which shows close to 1000GBps when caching is present. It is quite tricky there to avoid using caching in order to demonstrate the actualy global memory bandwidth of a device.

0 Likes