I can only tell you about how it works in nvidia cards. Their compiler usually (by default) inlines device functions although the GPUs can jump/call, you can force no-inline if you want. It's possible that the compiler has a heuristic not to inline really huge functions so that you don't bust I-cache but I haven't chcecked.
Code is cached in L1-I cache, which size is not documented but can be easily found out by microbenchmarking using loop unrolling (IIRC 8KB, not sure now). When your code is bigger than that, speed gradually lowers due to, I suppose, additional global memory fetches. Don't know if the code resides physically in global memory or somewhere else but I suspect gmem or rather some invisible partition of it. I'm not sure whether the code stays on the GPU in between kernel launches - I suspect it does but I can't back it up in any way.
I've tested ATI GPUs some time ago to find out that code cache size is only 48K. With "unroll everything" strategy it's too easy to exceed this value especially with VLIW architecture. And for RV770 kernel size > 48K is a performance killer, you can check out my original post.
I'm simply parsing the resulting ISA for "CodeLen=%d" string.
Originally posted by: Raistmer how to obtain isa file for Brook kernel?
Use SKA to get ISA from Brook+ kernel.
Just copy pase your code to the source code window, select Brook+ compiler and GPU, and you'll get the ISA code in the right side window.
Originally posted by: omkaranathan
Just copy pase your code to the source code window, select Brook+ compiler and GPU, and you'll get the ISA code in the right side window.