cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Raistmer
Adept II

Some performance-related questions about current OpenCL implementation

1) Is it possible to have local (per thread) array stored into registers in OpenCL?
It was not possible in Brook+.
That is, if I write inside kernel:
float4 buf[32];
will these 32 elements placed into registers or they will be spilled into global memory?

2) If write to global memory buffer resides in not chosen branch inside branch instruction (per wavefront basis) will this write be avoided or zeros or junk will be written anyway?
Also, will such write be avoided per wavefront or per thread basis?

3) How many registers can be used per thread to still hide global memory read latence more or less effectively? (or how many wavefronts per SIMD should be launched simultaneously to hide read latence?)
0 Likes
21 Replies
nou
Exemplar

1. private arrays are in global memory. use local arrays instead.and there was a note that devs are working on moving arrays into registers too.

0 Likes

Originally posted by: nou

1. private arrays are in global memory. use local arrays instead.and there was a note that devs are working on moving arrays into registers too.



Unfortunately it will not work for 4xxx that I targed currently. These GPUs have no accessible local memory, it's emulated via global.

It's a pity that register file can't be used too
0 Likes

Raistmer,
Our next release will move some private arrays into hardware indexable temp registers. Depending on the size and usage the compiler will determine if the accesses are converted to registers or backed into global memory.
0 Likes
Raistmer
Adept II

Thanks.
Could you give some insights about 2) and 3) questions, please?
0 Likes

Raistmer,
2) if the branch is not taken, the write is not executed.
3) The register count affects the number of wavefronts that can execute on the same SIMD. The number of wavefronts that are required to hide the memory latency is algorithm/kernel dependent.
0 Likes

Originally posted by: MicahVillmow

3) The register count affects the number of wavefronts that can execute on the same SIMD. The number of wavefronts that are required to hide the memory latency is algorithm/kernel dependent.


I consider such situation:
kernel has memory load operation only in very beginning of kernel. After that all computations carried in registers.
that is, for starting computations some values should be loaded from memory, each new wavefront will issue memory loads right after creation, no matter how long whole kernel is. Computations can start only after data will be in register, hence some latency from memory read. As I understand it, in such situation this latency can't be hided by computations inside kernel itself (at least until very first wavefront will not have its data loaded), the only possible way is to busy GPU by issuing new memory reads (this issue requires some cycles too probably). How many wavefronts should be launched in this situation before first one recives its data?
Or it's not possible to hide this latency at all, first few wavefronts always will suffer from delay?
0 Likes

The first wavefront will always suffer from this latency. The concept of hiding the memory latency by issuing more wavefronts is meaningful after the first wavefront starts executing. The initial setup time and the memory latency of first wavefront will always be there.



0 Likes

Originally posted by: omkaranathan

The first wavefront will always suffer from this latency. The concept of hiding the memory latency by issuing more wavefronts is meaningful after the first wavefront starts executing. The initial setup time and the memory latency of first wavefront will always be there.

 

Hasn't Fermi introduced the capability to schedule multiple kernels in order to exactly solve this problem ?

 

0 Likes

No matter how you play it, hiding latency by pipelining execution (ie. switching to another wavefront/warp/kernel/context/whatever to do some meaningful work during the wait) will not shorten the initial wait. The requests for memory will not come back to the first requester any sooner, Fermi or no Fermi it's true tho that kernel concurrency they introduced is yet another step in pipelining.

As for moving arrays to register - I presume indexing would have to be made entirely by literals in code, like array[2]. The moment you start using array you're forcing the compiler to assume there's arbitrary pointer arithmetic involved and I don't see how you could reliably implement pointer arithmetic over registers. What is register A plus 'k'? How can you take an address of a register?

0 Likes

Originally posted by: _Big_Mac_ No matter how you play it, hiding latency by pipelining execution (ie. switching to another wavefront/warp/kernel/context/whatever to do some meaningful work during the wait) will not shorten the initial wait. The requests for memory will not come back to the first requester any sooner, Fermi or no Fermi


 

What about if you use different kernels ? In one case you pay the "initial wait" every time you switch to the execution of a new kind of kernel while in the second case there is no penalty. It looks like it can make a huge difference in any application chain running different kernels (i.e. a quite common case).

 

0 Likes

Mulitple kernels still won't help really from the initial wait.

1st kernel 1st warp, you still have to wait for the fetch units (which are busy) to finish.

2nd kernel 1st warp, you have to wait for fetches, but fetch units are busy with 1st kernel 1st warp..

...so unless they ahve dedicated fetch units/kernel, which seems stupid.

Multiple kernels mainly increase performance through ALU utilization between two or more fetch bound kernels.

0 Likes

_Big_Mac_,
Our hardware can index into registers via a special addressing mode, however this will only occur with arrays that are fairly small and dynamically indexed(i.e. around 10 elements or less). Once the array gets above a certain size and requires to many registers it gets pushed into memory.
0 Likes

Originally posted by: MicahVillmow

_Big_Mac_,

Our hardware can index into registers via a special addressing mode, however this will only occur with arrays that are fairly small and dynamically indexed(i.e. around 10 elements or less). Once the array gets above a certain size and requires to many registers it gets pushed into memory.


Very bad behavior when big register file used as "cache" to avoid unneeded fetches from global memory.
16k of float4 registers per unit and only 10 of them can be used as array? Not good
0 Likes

Raistmer,
Each SIMD has 256x64 128bit registers. So that means each thread in a wavefront has access to 256 registers max. Some registers are reserved for temps giving about 240 registers per SIMD. This must be divided evenly between wavefronts in a group. In OpenCL the default group size is 256 threads or 4 wavefronts on the high end chips. That leaves 60 registers per thread. So, in this case 1/6th of registers are used for this array. In practice 4 wavefronts is not enough to hide all the latency on the chip, so the compiler will attempt to place multiple groups on a single SIMD. If two groups are placed, then each thread gets 30 registers and 1/3 would go to indexing into an array. If three groups are scheduled on a SIMD, then half the available registers would be allocated to indexing into an array.

So, as you can see, 10 vector elements can take up a fairly large amount of space in the register file.
0 Likes

Well, some tasks require more registers at expense of number of simultaneously running threads. This even reflected in OpenCL specs, when it talks about task-based parallelism.
For example, if I need only few threads, let say 2 threads per SIMD (my GPU has 10 SIMDs so it would be kernel of 20 threads), is it possible to use 32*256 registers in one thread (per SIMD) and 32*256 registers in another?
or no matter how many threads available, only 256 registers will be allocated per wavefront?
You gave answer from fully used wavefront capacity point of view, I need as many free registers as possible per thread (workitem), even if there will be only one thread per wavefront.

EDIT:
In short, what will be if I write smth like this in kernel:

float4 a0,a1,a2,a3,......a512;

Will such kernel fail? Will compiler place most of variables into global memory instead of registers? Is it possible (there is corresponding switch for compiler in CUDA, don't know about OpenCL) to instruct compiler to use more registers per thread than its default value?
0 Likes

Originally posted by: MicahVillmow
If three groups are scheduled on a SIMD, then half the available registers would be allocated to indexing into an array.


To indexing?? that is, 10 registers are needed as overhead to index into single array provided it would be implemented?? Why so many registers are used as index ???

Maybe some misunderstanding here,
float4 array[10];
will actually eat 20 registers, not 10 ?
0 Likes

Each thread is limited to 256 registers. Our register file is 64 wide, 256 deep and threads cannot access registers outside of their column.
0 Likes

Originally posted by: MicahVillmow

Each thread is limited to 256 registers. Our register file is 64 wide, 256 deep and threads cannot access registers outside of their column.


I see, thanks. It's very important info for optimization I trying to do. Knowing this limit will help a lot.
0 Likes

Raistmer,
No, if you use more than 10 elements or so in your array(not 100% sure about the cutoff limit), then you no longer will get dynamic indexing into the register file and instead the stack will be pushed out to memory.
0 Likes

I see, thanks.
Currently I emulate array of 64 floats via 16 float4 variables like
float4 o1,o1,o2,...,o15.
From your info I can conclude that if I would write
float4 o[16];
instead, I would end with this buffer be placed in global memory, that is, no benefit to use arrays.
Can I hope that o1 to o16 separate float4 variables will be keeped in registers? It far away from 256 registers per thread limit still...
0 Likes

Raistmer,
If you use an array and you don't dynamically index into the array, they should be placed in registers as the array is not required. It is with dynamic indexing into the array that performance issues arise.
0 Likes