I mean how kernel binary passed to GPU (on context creation, before kernel launch and so forth)?
Where it stored? (CPU uses usual system memory for code storage but has separare L1 instruction cache to speed up instruction fetching. How about GPU? Does kernel binary stored in GPU global memory or in special limited-size buffer? How instructions are fetched? Do they go through common cache with data or constant cache used or smth special instead?)
How big possible performance impact for code bloating? (I mean if one uses few different special kernels to do similar work instead of one more slow/complex but universal one - how increase in total number of kernels and increase in total kernels binary size will impact performance? )
There is info regarding data memory handling on GPU but not so much if any regarding code data handling. Worth to discuss?