I'm wondering what's the main difference between such types of memory.
There are some examples for CL, which could clarify it a bit [but they don't, for me anyway] - for instance transpose or multiply matrices uses local memory and on the other hand sobel filter or matrix convolution works without any local memory.
I know that fetch data from local is probably faster than from global, but copying data from global to local also costs some time - isn't it ?
So far I assume that local memory might be more efficient while user must deal with a lot of data whereas global memory is more effective when user operates just on each element and moreover needs additional access to the element neighbourhood - Am I right ?
Thanks in advance for any response.
You are correct that you need to move data from global to local and then you can use if from local. If you are just computing on a data element once, it makes little sense to move it from global to local instead of just computing on the global value.
The general reason to use local is that you have significant data reuse, and generally read-modify-write reuse. Matrix multiply is an example where you will load data into local because you will reuse the same data elements multiple times. For something like a small window convolution, like 3x3, it may not make sense to move the data to local first because their may not be enough reuse. (If you are very careful about how you do your convolution, using local *may* actually be a win)
When I list the device properties for the RV770 local memory type is listed as GLOBAL. This would suggest for this device (and similar ones) there is no benefit to using local memory, as the access would have the same "cost", with the possible exception of using the copy to organize the data to get better access patterns. Is this right or am I missing something?
is this true for kernel locals (what would be stack in a "normal" program)?
Are they also allocated in the global memory?
Do any of the AMD GPUs have true local memory?