cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Raistmer
Adept II

Some performance-related questions about current OpenCL implementation

1) Is it possible to have local (per thread) array stored into registers in OpenCL?
It was not possible in Brook+.
That is, if I write inside kernel:
float4 buf[32];
will these 32 elements placed into registers or they will be spilled into global memory?

2) If write to global memory buffer resides in not chosen branch inside branch instruction (per wavefront basis) will this write be avoided or zeros or junk will be written anyway?
Also, will such write be avoided per wavefront or per thread basis?

3) How many registers can be used per thread to still hide global memory read latence more or less effectively? (or how many wavefronts per SIMD should be launched simultaneously to hide read latence?)
Tags (3)
0 Likes
21 Replies
nou
Exemplar

Some performance-related questions about current OpenCL implementation

1. private arrays are in global memory. use local arrays instead.and there was a note that devs are working on moving arrays into registers too.

0 Likes
Raistmer
Adept II

Some performance-related questions about current OpenCL implementation

Originally posted by: nou

1. private arrays are in global memory. use local arrays instead.and there was a note that devs are working on moving arrays into registers too.



Unfortunately it will not work for 4xxx that I targed currently. These GPUs have no accessible local memory, it's emulated via global.

It's a pity that register file can't be used too
0 Likes
MicahVillmow
Staff
Staff

Some performance-related questions about current OpenCL implementation

Raistmer,
Our next release will move some private arrays into hardware indexable temp registers. Depending on the size and usage the compiler will determine if the accesses are converted to registers or backed into global memory.
0 Likes
Raistmer
Adept II

Some performance-related questions about current OpenCL implementation

Thanks.
Could you give some insights about 2) and 3) questions, please?
0 Likes
MicahVillmow
Staff
Staff

Some performance-related questions about current OpenCL implementation

Raistmer,
2) if the branch is not taken, the write is not executed.
3) The register count affects the number of wavefronts that can execute on the same SIMD. The number of wavefronts that are required to hide the memory latency is algorithm/kernel dependent.
0 Likes
Raistmer
Adept II

Some performance-related questions about current OpenCL implementation

Originally posted by: MicahVillmow

3) The register count affects the number of wavefronts that can execute on the same SIMD. The number of wavefronts that are required to hide the memory latency is algorithm/kernel dependent.


I consider such situation:
kernel has memory load operation only in very beginning of kernel. After that all computations carried in registers.
that is, for starting computations some values should be loaded from memory, each new wavefront will issue memory loads right after creation, no matter how long whole kernel is. Computations can start only after data will be in register, hence some latency from memory read. As I understand it, in such situation this latency can't be hided by computations inside kernel itself (at least until very first wavefront will not have its data loaded), the only possible way is to busy GPU by issuing new memory reads (this issue requires some cycles too probably). How many wavefronts should be launched in this situation before first one recives its data?
Or it's not possible to hide this latency at all, first few wavefronts always will suffer from delay?
0 Likes
omkaranathan
Journeyman III

Some performance-related questions about current OpenCL implementation

The first wavefront will always suffer from this latency. The concept of hiding the memory latency by issuing more wavefronts is meaningful after the first wavefront starts executing. The initial setup time and the memory latency of first wavefront will always be there.



0 Likes
davibu
Journeyman III

Some performance-related questions about current OpenCL implementation

Originally posted by: omkaranathan

The first wavefront will always suffer from this latency. The concept of hiding the memory latency by issuing more wavefronts is meaningful after the first wavefront starts executing. The initial setup time and the memory latency of first wavefront will always be there.

 

Hasn't Fermi introduced the capability to schedule multiple kernels in order to exactly solve this problem ?

 

0 Likes
_Big_Mac_
Journeyman III

Some performance-related questions about current OpenCL implementation

No matter how you play it, hiding latency by pipelining execution (ie. switching to another wavefront/warp/kernel/context/whatever to do some meaningful work during the wait) will not shorten the initial wait. The requests for memory will not come back to the first requester any sooner, Fermi or no Fermi it's true tho that kernel concurrency they introduced is yet another step in pipelining.

As for moving arrays to register - I presume indexing would have to be made entirely by literals in code, like array[2]. The moment you start using array you're forcing the compiler to assume there's arbitrary pointer arithmetic involved and I don't see how you could reliably implement pointer arithmetic over registers. What is register A plus 'k'? How can you take an address of a register?

0 Likes