there are several AMD slides floating taking about GDS on 4850/4870 GPUs, global data store. I am not sure how to access them in the current 1.21 beta release.
Another question is that, from the header files in 1.21 beta, there is another term, global GPR. What is the different from GDS?
BTW, I dumped the properties reported from CAL package on 4850. It said
GDS is not supported and Global GPR is supported. What's wrong with my GPU?
which one should be interpreted as the replacement of CUDA's atomic operations?
looks like shared register cannot be directly mapped onto CUDA's atomic operator since it doesn't allow across different thread indices. how about the future GDS support?
Originally posted by: MicahVillmow GDS is not currently supported in CAL even though the hardware does have this feature. This is mainly because there is no global locking mechanism to synchronize on and therefor there is no good way of using it.
But we do not necessarily need locks to operate on shared memory. What may suffice is memory fence to enfoce consistency, such as we have with LDS. Do you mean there is no global memory fence?
There is no way to synchronize across SIMD's on the HD4XXX series of chips, so it is not possible to enforce any type of memory consistency.
Originally posted by: MicahVillmow There is no way to synchronize across SIMD's on the HD4XXX series of chips, so it is not possible to enforce any type of memory consistency.
If global data store permits reads and writes from threads running on different SIMDs, you can implement global barrier synchronization in software, can't you?
vvolkov,
The problem is that you need some sort of way to do a read-modify-write atomically to do software based synchronization.
There are some uses for GDS on HD4XXX, but they are not generic enough to allow for general usage and as such are not exposed.
Originally posted by: MicahVillmow
The problem is that you need some sort of way to do a read-modify-write atomically to do software based synchronization.
Barrier can be implemented without using atomic updates. Such are used on multicore CPUs and prototypes exist for NVIDIA GPUs. The idea is to avoid race condition on updating a shared variable by replicating it across the thread array, so that each thread can update only its private copy. Suppose we have N threads. Then threads 2, ..., N increment variables 2, ..., N correspondingly and busy-wait until variable 1 is incremented. Thread 1 busy-waits until all variables 2, ..., N are incremented and then increments variable 1. This signals other threads to proceed. This implements barrier. No atomic updates are necessary since there is no race conditions. You may need to substitute "thread" with "thread block" to implement it on GPU.
vvolkov,
This way could possibly be done, but I'd posit that it is way to inefficient to deal with. You can actually do this method now with using global buffer as a replacement for GDS to get some timing numbers, but i'm guessing it would take 10s-100s of thousand of cycles for a single barrier which would make it neither efficient or useful. I'd like to be proved wrong though.
It probably would be quicker to do something similiar to what mcuda does and break the kernel into multiple kernels at each global barrier point.
Originally posted by: MicahVillmow vvolkov,
This way could possibly be done, but I'd posit that it is way to inefficient to deal with. You can actually do this method now with using global buffer as a replacement for GDS to get some timing numbers, but i'm guessing it would take 10s-100s of thousand of cycles for a single barrier which would make it neither efficient or useful. I'd like to be proved wrong though.
It probably would be quicker to do something similiar to what mcuda does and break the kernel into multiple kernels at each global barrier point.
I have performance numbers for NVIDIA GPUs using their "global memory". Running many such barriers back to back results in ~1-2 microseconds per barrier. This is close to 4 memory latencies, which sounds optimal for this algorithm.
1-2 microseconds is still less than ~3-7 microseconds required to launch new kernel in CUDA. There was a guy at NVIDIA forum who claimed getting speedups using this technique. There is a problem with memory consistency though.
I don't have solid results on AMD GPUs yet, but it seems that launching new kernel using calCtxRunProgram costs around 10 microseconds, which is ~10,000 shader clock cycles. So, synchronizing via global buffer may still be faster than breaking into multiple kernels.
Syncronizing via GDS might be even faster since it is on-chip, so it is likely to have smaller latency than global buffer.
Vasily
May a wavefront finish before another one from the same kernel een start?
If so, it woun't be possible to the first thread busywait all others but it may be possible to each thread (let's say, the first of each wavefront) busywait the previous one, it wouldn't allow a perfect memory barrier but, assuming wavefronts are started in order and stores are executed in order too, it may allow ordering access one serial read/write access to each global variable.
This may avoid launching a new stream in some algorithms.