1. private arrays are in global memory. use local arrays instead.and there was a note that devs are working on moving arrays into registers too.
Originally posted by: nou
1. private arrays are in global memory. use local arrays instead.and there was a note that devs are working on moving arrays into registers too.
Originally posted by: MicahVillmow
3) The register count affects the number of wavefronts that can execute on the same SIMD. The number of wavefronts that are required to hide the memory latency is algorithm/kernel dependent.
The first wavefront will always suffer from this latency. The concept of hiding the memory latency by issuing more wavefronts is meaningful after the first wavefront starts executing. The initial setup time and the memory latency of first wavefront will always be there.
Originally posted by: omkaranathan
The first wavefront will always suffer from this latency. The concept of hiding the memory latency by issuing more wavefronts is meaningful after the first wavefront starts executing. The initial setup time and the memory latency of first wavefront will always be there.
Hasn't Fermi introduced the capability to schedule multiple kernels in order to exactly solve this problem ?
No matter how you play it, hiding latency by pipelining execution (ie. switching to another wavefront/warp/kernel/context/whatever to do some meaningful work during the wait) will not shorten the initial wait. The requests for memory will not come back to the first requester any sooner, Fermi or no Fermi it's true tho that kernel concurrency they introduced is yet another step in pipelining.
As for moving arrays to register - I presume indexing would have to be made entirely by literals in code, like array[2]. The moment you start using array
Originally posted by: _Big_Mac_ No matter how you play it, hiding latency by pipelining execution (ie. switching to another wavefront/warp/kernel/context/whatever to do some meaningful work during the wait) will not shorten the initial wait. The requests for memory will not come back to the first requester any sooner, Fermi or no Fermi
What about if you use different kernels ? In one case you pay the "initial wait" every time you switch to the execution of a new kind of kernel while in the second case there is no penalty. It looks like it can make a huge difference in any application chain running different kernels (i.e. a quite common case).
Mulitple kernels still won't help really from the initial wait.
1st kernel 1st warp, you still have to wait for the fetch units (which are busy) to finish.
2nd kernel 1st warp, you have to wait for fetches, but fetch units are busy with 1st kernel 1st warp..
...so unless they ahve dedicated fetch units/kernel, which seems stupid.
Multiple kernels mainly increase performance through ALU utilization between two or more fetch bound kernels.
Originally posted by: MicahVillmow
_Big_Mac_,
Our hardware can index into registers via a special addressing mode, however this will only occur with arrays that are fairly small and dynamically indexed(i.e. around 10 elements or less). Once the array gets above a certain size and requires to many registers it gets pushed into memory.
Originally posted by: MicahVillmow
If three groups are scheduled on a SIMD, then half the available registers would be allocated to indexing into an array.
Originally posted by: MicahVillmow
Each thread is limited to 256 registers. Our register file is 64 wide, 256 deep and threads cannot access registers outside of their column.