I've been working on a program and need to guarantee that reads to global memory are actually happening. I can use a mem_fence to guarantee writes are committed to physical memory (according to the IL that is generated).
It doesn't appear that volatile is giving me the desired effect as my code is still getting deadlocked.
There seems to be an IL read function field for UAV_RAW_LOAD (_cached), that "Specifies whether load is forced through the cache, which must be of type UAV_READ." What does "forced through" mean in this context?
Not knowing your code I have to ask: What kind of deadlock are you getting? As there is no gueranteed order to the way blocks are scheduled it may happen that the block supposed to do a write is never executed until the block that is supposed to do the reading is finished.
The problem seems to be that the writes seem to pass through, but the reads get old values. The L1 cache isn't coherent. You can use atomics to accomplish this, but it is quite slow. It would be good if there was another way. I wish there was an equivalent of CUDA's __threadfence to get global memory consistency. Using separate kernels to achieve the same thing is proving to have a very high overhead.
OpenCL has a mem_fence intrinsic (page 230 of the 1.1 spec). Just pass CLK_GLOBAL_MEM_FENCE as the second parameter.
"Orders loads and stores of a work-item executing a kernel. This means that loads and stores preceding the
mem_fence will be committed to memory before any loads and stores following the
There is no guarantee of global visibility involved in that statement. The compiler does not currently generate globally coherent reads and writes in the presence of mem_fence.
There is absolutely no way to do this.
First of all OpenCL architecture currently doesn't provide global memory consistency across different work-groups within one kernel, only across two kernels executing one by one... It's "by design" feature...
There is another reason of such behavior, it's GPU architecture. Different work-groups may be executed on different compute units, which don't know anything about each other and also have different caches. So even if you flush global memory cache by mem_fence it doesn't mean that work-groups from another unit will see updated memory at moment of read. Cause you don't know work-groups execution order and moreover you cannot provide cache coherency across separated caches of different compute units.
I don't know what problem you faced with. But maybe this trick will help you: you can call mem_fence in each work-group, and then do atomic increment of global memory variable. Then when you do atomic increment, you can read old value, so you can determine the last work-item that wrote to global memory. At this moment it is guarantied that all work-items has written to memory and there is completely updated memory. But unfortunately only last work-group will know it.