Consider the following function:
void Swap(double *pA, double *pB)
double t = *pA;
*pA = *pB;
*pB = t;
This involves 4 global memory operations: 2 reads and 2 writes (and 2 register operations, I think). Can't the same be done in one "exchange" operation on the global memory?
The point is that read&write operations transfer the data between the memory and a processing unit (involving the cache and polluting it). However, in a swap operation the processing unit (CPU or GPU) doesn't need the data: swap operation could be performed internally in the memory chips.
Whatever the communication path could be between memory chips, it is shorter than communication with the processing unit. The processor (CPU or GPU) gets involved, because if those memory addresses are cached, it needs to swap their values in its cache. It is just that the processing unit doesn't need to read and write the data from&to the global memory: it just needs to send a command to the global memory.
Because current computer programs are mostly memory speed bound, and swap operations are not rare, such an addition to the instruction set architecture should improve performance a lot: the number of memory-CPU or memory-GPU transfers is reduced 4 times (2 reads and 2 writes vs. 1 command to exchange).