What happens when I do this, pls?
__kernel void main ()
{
int myArray[60];
}
Is myArray stored in local memory? global memory? private local memory?
I use a 256x128 grid with 16x4 workgroups.
thx
Originally posted by: MicahVillmow Please reference the OpenCL spec for these type of questions, they are all answered in Section 6.5. The spec can be located here: http://www.khronos.org/opencl/
Private space. The question is... is that implemented as global memory with 120 cycles penalty if I don't coalesce the access?
For example, in CUDA, you can use the __local modifier to avoid the need for coalescing ( that's not OpenCL's local memory but a modifier to use private memory without having to coalesce the accesses ). I want to know how this works for ATI cards and OpenCL pls.
My kernel is very complex and I cannot use OpenCL's local memory, so I need the fastest available way to use a 60-ints stack per thread.
Originally posted by: MicahVillmow Edit: Private memory goes to scratch, which exists in Global memory.
That's fine but... do I need to perform any kind of coalescing to improve the speed, pls? Or does it work like the CUDA's __local mem? Or perhaps your compiler takes care about that automatically?
But, as I indicate, I just need 60 ints, so that will fit in the "256 vec4" case ( my kernel is using 39 GPRs currently )? What's the total private memory's size for each thread/work group/compute unit?
I cannot use local memory, because 16x4 x 60 x sizeof(int) will cause to execute a maximum of 2 wavefronts per compute unit ( vs the optimal 24... so the kernel speed will sux )... or 2 blocks vs the optimal 8 blocks/compute unit.... resulting in an a 8% and 25% occupancy, respectively.
And If I'm forced to use global memory... how to coalesce it correctly?. I use 128x256 grids, with 16x4 work groups. Should I then allocate a 128x256x60xsizeof(int) global memory buffer? I was really searching a feature like CUDA's __local, which provides global memory without having to coalesce the accesses manually.
You want to aim for 6 or more hardware threads per SIMD which is a limit of about 42 GPRs (256/6), as a minimum. This allows pairs of hardware threads to occupy the 3 primary execution units: sequencer (control flow, mem export), ALU and load/TEX simultaneously.
You need to use SKA to find out the actual GPR allocation of your code.
How many bits of each of these 60 ints are you using? e.g. if you're only using 16 bits from each, they'll pack to half size.
If you allocate another 15 GPRs for these ints then you'll be down to 4 hardware threads (256/54). This should be faster than putting this data in global memory.
Depending on scope, the impact might be less than an additional 15 GPRs.
Originally posted by: Jawed You want to aim for 6 or more hardware threads per SIMD which is a limit of about 42 GPRs (256/6), as a minimum. This allows pairs of hardware threads to occupy the 3 primary execution units: sequencer (control flow, mem export), ALU and load/TEX simultaneously.
Very nice advice, thx!
You need to use SKA to find out the actual GPR allocation of your code.
I'm lucky, 40
How many bits of each of these 60 ints are you using? e.g. if you're only using 16 bits from each, they'll pack to half size.
I'm afraid the 32bits.
If you allocate another 15 GPRs for these ints then you'll be down to 4 hardware threads (256/54). This should be faster than putting this data in global memory.
I'll try to force global memory and see if that helps, thx.
5750 With manual 32b global coalescing: 140s, 64b: 240s, 128b=323s
5750 With int var[60], 120s
GT240 NVIDIA int var[60] : 48s
I'm using a 256x256 grid with 16x8 worksize, 40 GPR registers with the int v[60] ( 32reg for NVIDIA with a 50% occupancy )