Archives Discussions

bubu · ‎06-07-2010

What happens when I do this, pls?

__kernel void main ()

{

int myArray[60];

}

Is myArray stored in local memory? global memory? private local memory?

I use a 256x128 grid with 16x4 workgroups.

thx

MicahVillmow · ‎06-07-2010

Please reference the OpenCL spec for these type of questions, they are all answered in Section 6.5.
The spec can be located here:
http://www.khronos.org/opencl/

bubu · ‎06-07-2010

Originally posted by: MicahVillmow Please reference the OpenCL spec for these type of questions, they are all answered in Section 6.5. The spec can be located here: http://www.khronos.org/opencl/

Private space. The question is... is that implemented as global memory with 120 cycles penalty if I don't coalesce the access?

For example, in CUDA, you can use the __local modifier to avoid the need for coalescing ( that's not OpenCL's local memory but a modifier to use private memory without having to coalesce the accesses ). I want to know how this works for ATI cards and OpenCL pls.

My kernel is very complex and I cannot use OpenCL's local memory, so I need the fastest available way to use a 60-ints stack per thread.

MicahVillmow · ‎06-07-2010

Edit: Private memory goes to scratch, which exists in Global memory.

bubu · ‎06-07-2010

Originally posted by: MicahVillmow Edit: Private memory goes to scratch, which exists in Global memory.

That's fine but... do I need to perform any kind of coalescing to improve the speed, pls? Or does it work like the CUDA's __local mem? Or perhaps your compiler takes care about that automatically?

MicahVillmow · ‎06-08-2010

From Section 4.4 of our programming guide on Global Memory.
"The GPU memory subsystem can coalesce multiple concurrent accesses to
global memory, provided the memory addresses increase sequentially across the
work-items in the wavefront and start on a 128-byte alignment boundary."

This is impossible to do w/ private memory sizes larger than a single 128bit value as private memory is allocated as a large chunk for the thread and not interleaved with other threads. So, if you want fast memory, you need to use local memory(you have up to 32KB) or use registers(up to 256 vec4 registers).

Also, the first of our General Tips is relevant here:
"Avoid declaring global arrays on the kernel's stack frame as these typically
cannot be allocated in registers and require expensive global memory
operations."

bubu · ‎06-08-2010

But, as I indicate, I just need 60 ints, so that will fit in the "256 vec4" case ( my kernel is using 39 GPRs currently )? What's the total private memory's size for each thread/work group/compute unit?

I cannot use local memory, because 16x4 x 60 x sizeof(int) will cause to execute a maximum of 2 wavefronts per compute unit ( vs the optimal 24... so the kernel speed will sux )... or 2 blocks vs the optimal 8 blocks/compute unit.... resulting in an a 8% and 25% occupancy, respectively.

And If I'm forced to use global memory... how to coalesce it correctly?. I use 128x256 grids, with 16x4 work groups. Should I then allocate a 128x256x60xsizeof(int) global memory buffer? I was really searching a feature like CUDA's __local, which provides global memory without having to coalesce the accesses manually.

Jawed · ‎06-12-2010

You want to aim for 6 or more hardware threads per SIMD which is a limit of about 42 GPRs (256/6), as a minimum. This allows pairs of hardware threads to occupy the 3 primary execution units: sequencer (control flow, mem export), ALU and load/TEX simultaneously.

You need to use SKA to find out the actual GPR allocation of your code.

How many bits of each of these 60 ints are you using? e.g. if you're only using 16 bits from each, they'll pack to half size.

If you allocate another 15 GPRs for these ints then you'll be down to 4 hardware threads (256/54). This should be faster than putting this data in global memory.

Depending on scope, the impact might be less than an additional 15 GPRs.

bubu · ‎06-13-2010

Originally posted by: Jawed You want to aim for 6 or more hardware threads per SIMD which is a limit of about 42 GPRs (256/6), as a minimum. This allows pairs of hardware threads to occupy the 3 primary execution units: sequencer (control flow, mem export), ALU and load/TEX simultaneously.

Very nice advice, thx!

You need to use SKA to find out the actual GPR allocation of your code.

I'm lucky, 40

How many bits of each of these 60 ints are you using? e.g. if you're only using 16 bits from each, they'll pack to half size.

I'm afraid the 32bits.

If you allocate another 15 GPRs for these ints then you'll be down to 4 hardware threads (256/54). This should be faster than putting this data in global memory.

I'll try to force global memory and see if that helps, thx.

bubu · ‎06-13-2010

5750 With manual 32b global coalescing: 140s, 64b: 240s, 128b=323s

5750 With int var[60], 120s

GT240 NVIDIA int var[60] : 48s

I'm using a 256x256 grid with 16x8 worksize, 40 GPR registers with the int v[60] ( 32reg for NVIDIA with a 50% occupancy )

MicahVillmow · ‎06-08-2010

Max private size is 64KB per thread per the IL spec. Per section 4.4 of our Programming Guide on how to coalesce global memory.
"The GPU memory subsystem can coalesce multiple concurrent accesses to
global memory, provided the memory addresses increase sequentially across the
work-items in the wavefront and start on a 128-byte alignment boundary."

Archives Discussions

array in kernel