cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Raistmer
Adept II

Private array spilled in global memory?

In kernel I use:
float4 d[128];

Assembly for RV770 gives:
01 MEM_SCRATCH_WRITE: VEC_PTR[0], R0, ARRAY_SIZE(128) ELEM_SIZE(3)
....
128 MEM_SCRATCH_WRITE: VEC_PTR[127], R0, ARRAY_SIZE(128) ELEM_SIZE(3)
129 MEM_SCRATCH_WRITE_ACK: VEC_PTR[128], R0, ARRAY_SIZE(128) ELEM_SIZE(3)

What it means? Whole array spilled into global memory?
I'm trying to cache data in registers to afoid global memory accesses so register spilling in memory not an option at all...
0 Likes
8 Replies

Raistmer,
The default setting of OpenCL is to run 4 wavefronts per group. This can overridden with the kernel attribute, __attribute__((reqd_work_group_size(X, Y, Z))). So by default if you use more than ~60 registers, you will spill to memory.
Also, private arrays will only be pushed into registers only if the indexing pattern is simple or the size is small. Otherwise we use the hardware scratch mechanism which is stored in global memory.

Use arrays only as a last resort, they are not fast.
0 Likes

Indexing pattern very simple - for (int i=0;i < 128; i++) d [ i ] = something.
Size rather big, 128.
That is, I can't use array notation to address 128 float4 registers w/o spilling them into memory?
If I will use 128 separate variable names instead kernel will be just huge and very ugly
0 Likes

Raistmer,
If you limit the work group size via the above attribute to 64 work-items, this might not spill into memory.
0 Likes

With __attribute__((reqd_work_group_size(32, 1, 1))) SKA shows same number of scratch registers used as w/o this attribute.
It shows 129 scratch registers and 30 GPR.

Why numbers remained the same ? SKA doesn't understand this attribute?
0 Likes

Consider placing your array in local memory, if it is accessed often.

0 Likes

AFAIK no true local memory on 4xxx GPUs. It will be emulated via global memory accesses - exactly what i trying to avoid - accesses to global memory (via spilled registers or via emulated local memory - it will be much slower than true onchip register access)
0 Likes

Raistmer,
The problem is just that the array is to large and it fails some heuristic check that our cal compiler does on if it should attempt to use registers or not.
0 Likes

Originally posted by: MicahVillmow

Raistmer,

The problem is just that the array is to large and it fails some heuristic check that our cal compiler does on if it should attempt to use registers or not.


That is, I should use 128 different variable names to take advantage of so many possible registers per workitem. Pity, indeed
Maybe some compiler switches (look at NV's compiler - it has option to limit/set number of registers per thread) that could replace default compiler behavior when needed?
In general, compiler can't be clever enough to cover all possible cases, right? Letting some manual control on its desisions can be very useful.
Surely my case not very suitable for GPU, but it can be handled much better with already available hardware.
0 Likes