cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ginquo
Adept II

Global locking in OpenCL kernel

Hello,

I'm trying to implement locking of global buffers so I can apply a blit operation to a tile (which may be blitted to by other work-groups at the same time).

My current implementation looks like this.


// kernel arguments:


// volatile global float4* color_buffer ... color framebuffer, organized in tiles


// volatile global float* depth_buffer .... pixel depths associated with framebuffer


// volatile global int* tile_locks ........ a lock value for each tile of the framebuffer (1 if unlocked)



// local/private vars:


// local float4 colors[8][8] ... color tile that is going to be blitted


// local float depths[8][8] .... depth values of the associated tile



// private vars:


// int2 l ........ local id of work item in range [0,7]x[0,7]


// int head ...... 1 if l == (0,0) otherwise 0


// int tile_id ... index of the tile that is blitted to


// int fb_id ..... index of the pixel of the tile that is written to




// blit tile


if (head) while (!atomic_xchg(&(tile_locks[tile_id]), 0));


barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);



if (depths[l.y][l.x] < depth_buffer[fb_id]) {


    depth_buffer[fb_id] = depths[l.y][l.x];


    color_buffer[fb_id] = colors[l.y][l.x];


}



barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);


if (head) atomic_xchg(&(tile_locks[tile_id]), 1);


The idea is to have the "main" work item acquire a lock for the tile, while all the others are waiting for it, apply the blit operation and then have the main item unlock it again. All using atomic operations.

However, this does not seem to work. Individual work groups are writing over each other as if no locking at all takes place. Are there any obvious errors I'm making here? Is global locking possible in OpenCL?

I'm using a Radeon HD 7970 with the Catalyst 13.8 beta on Linux.

0 Likes
1 Solution
drallan
Challenger



The idea is to have the "main" work item acquire a lock for the tile, while all the others are waiting for it, apply the blit operation and then have the main item unlock it again. All using atomic operations.



However, this does not seem to work. Individual work groups are writing over each other as if no locking at all takes place. Are there any obvious errors I'm making here? Is global locking possible in OpenCL?




Scrambled output data can happen when data is stuck in a (non-global) cache.

Atomic operations are globally coherent within a single GPU and can be used for locking at that level, at least with GCN, 7970, etc.

Memory read/writes can get stuck in the non-global L1 cache(s). Volatile is a generic compiler directive to tell the compiler that a variable may change. I have not seen that volatile affects caching of ordinary memory operations.

You could try to implement the program using all atomic operations (as in your second sentence !) by replacing memory reads/writes with say atomic_xchg(), atomic_or() etc and it may work. If the lock is working, all you need is to make sure that depths is globally updated, which the atomics will do. But make sure that depths starts out as globally coherent (i.e., loaded from the host).

Memory writes will update global memory, but they may leave traces of the data in the local caches.

The atomics will remove that, according to my understanding of the GCN manual.

  1. if (depths[l.y][l.x] < depth_buffer[fb_id]) {    <--- here  
  2.     depth_buffer[fb_id] = depths[l.y][l.x];      <--- here
  3.     color_buffer[fb_id] = colors[l.y][l.x];      <--- here ?
  4. }

View solution in original post

0 Likes
5 Replies
siu
Staff
Staff

One issue I could see is that your algorithm may require memory coherency between different workgroups.

Currently in the OpenCL 1.2 specification, there's no mechanism to ensure global memory coherency between workgroups while the kernel is executing.  The synchronization point is at the end of the kernel execution.  The upcoming OpenCL 2.0 standard will have the necessary APIs to support coherency.

So without the memory coherency, for example, data written to depth_buffer is not guaranteed to be seen by a different workgroup.  The barrier only ensures synchronization between workitems within the same group.

Hmm.. So OpenCL 1.2 global atomic functions only guarantee atomicity within a work-group? Wouldn't declaring the global buffers as volatile force the compiler to immediately commit any reads/writes where they happen?

0 Likes

Yes... global atomic functions guarantees only atomicity within a workgroup. There is no synchronisation between workgroups.

drallan
Challenger



The idea is to have the "main" work item acquire a lock for the tile, while all the others are waiting for it, apply the blit operation and then have the main item unlock it again. All using atomic operations.



However, this does not seem to work. Individual work groups are writing over each other as if no locking at all takes place. Are there any obvious errors I'm making here? Is global locking possible in OpenCL?




Scrambled output data can happen when data is stuck in a (non-global) cache.

Atomic operations are globally coherent within a single GPU and can be used for locking at that level, at least with GCN, 7970, etc.

Memory read/writes can get stuck in the non-global L1 cache(s). Volatile is a generic compiler directive to tell the compiler that a variable may change. I have not seen that volatile affects caching of ordinary memory operations.

You could try to implement the program using all atomic operations (as in your second sentence !) by replacing memory reads/writes with say atomic_xchg(), atomic_or() etc and it may work. If the lock is working, all you need is to make sure that depths is globally updated, which the atomics will do. But make sure that depths starts out as globally coherent (i.e., loaded from the host).

Memory writes will update global memory, but they may leave traces of the data in the local caches.

The atomics will remove that, according to my understanding of the GCN manual.

  1. if (depths[l.y][l.x] < depth_buffer[fb_id]) {    <--- here  
  2.     depth_buffer[fb_id] = depths[l.y][l.x];      <--- here
  3.     color_buffer[fb_id] = colors[l.y][l.x];      <--- here ?
  4. }
0 Likes

Thanks that did the trick.

I now use atomic_min(depth_buffer, depths[l.y][l.x]) to check and update update the depth values. I had to convert the depth from float to int but that's okay.

I guess I should look more into avoiding locks like these in general.

Will OpenCL 2.0 provide some way to do global reads/stores that forego the cache? I looked for the memory coherency APIs Siu talked about in the provisionary OpenCL 2.0 spec, but I couldn't find anything on this.

0 Likes