Archives Discussions

Raistmer · ‎04-22-2010

will all those calls inlined?

If I have many different kernels with similar parts is it worth to separate this part into function? Will GPU use smth like call instruction or this function will be just inlined back into all kernels that call this function by compiler?

Can some code size saving be achieved by using separate functions inside kernels or it's just for programmer convenience?

And another question: where kernels are stored at runtime? Will kernel be stored into GPU globaal memory, into some special limited memory area ot it will be uploaded onto GPU at each kernel call?

MicahVillmow · ‎04-22-2010

Raistmer,
Because of how the GPU works, all functions are inlined before ISA is generated. However, I would recommend using functions when possible as the compiler does not inline until it absolutely has to, resulting in better compilation times and possibly better code.

The kernel must be uploaded to the GPU when it is executed as there is no resource sharing between Compute and Graphics. When one of the modes takes control of the GPU, it takes control of the whole device.

_Big_Mac_ · ‎04-22-2010

I can only tell you about how it works in nvidia cards. Their compiler usually (by default) inlines device functions although the GPUs can jump/call, you can force no-inline if you want. It's possible that the compiler has a heuristic not to inline really huge functions so that you don't bust I-cache but I haven't chcecked.

Code is cached in L1-I cache, which size is not documented but can be easily found out by microbenchmarking using loop unrolling (IIRC 8KB, not sure now). When your code is bigger than that, speed gradually lowers due to, I suppose, additional global memory fetches. Don't know if the code resides physically in global memory or somewhere else but I suspect gmem or rather some invisible partition of it. I'm not sure whether the code stays on the GPU in between kernel launches - I suspect it does but I can't back it up in any way.

Raistmer · ‎04-22-2010

Thanks for answers!
I ask because I have some general kernel that handles all cases but can use many slightly different kernels to speedup particular sizes of input array. Cause total size of those special kernels can be considerable higher than size of general single kernel, I would like to know would be such optimization counter-productive or not?
I understand that this will surely increase startup time because of more work to compiler but in my case app can run many hours so startup time should not dominate.

@Micah
I'm not quite understand about Compute and Graphics modes.
If I allocate few hundreds of megabytes on GPU card for data, will they all be backed up to host memory on each mode switch ?? It would be just complete performance killer action

empty_knapsack · ‎04-23-2010

I've tested ATI GPUs some time ago to find out that code cache size is only 48K. With "unroll everything" strategy it's too easy to exceed this value especially with VLIW architecture. And for RV770 kernel size > 48K is a performance killer, you can check out my original post.

Raistmer · ‎04-23-2010

Very interesting, thanks!
And how did you measure size of binaries?
For example, I have Brook kernel and I have OpenCL kernel. How to say what binary size will be? (what generated file I should look for? )

empty_knapsack · ‎04-23-2010

I'm simply parsing the resulting ISA for "CodeLen=%d" string.

Raistmer · ‎04-23-2010

Oh

For OpenCL kernel I'm clearly out of range...
CodeLen = 76336;Bytes

And how to obtain isa file for Brook kernel?

genaganna · ‎04-24-2010

Originally posted by: Raistmer how to obtain isa file for Brook kernel?

Use SKA to get ISA from Brook+ kernel.

Raistmer · ‎04-24-2010

Could you be more specific, please?
How to do that?
When I try to do "Export object" from Stream Kernel Analyser, it does absolutely nothing. No file created with specified name.
I use SKA 1.4, is it bugged too ?

omkaranathan · ‎04-25-2010

Just copy pase your code to the source code window, select Brook+ compiler and GPU, and you'll get the ISA code in the right side window.

Raistmer · ‎04-25-2010

Originally posted by: omkaranathan

Just copy pase your code to the source code window, select Brook+ compiler and GPU, and you'll get the ISA code in the right side window.

If you follow this thread, you will see that I'm not need disassembled ISA code. I need size of binary for that code. Sorry, but I can't calculate size of binary from assembly listing. Knowledge of binary size of each involved instruction nessesary for this action.
Any another suggestions maybe?

Archives Discussions

How GPU handles many kernels calling same function?