Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Adept I

Questions about kernel instruction flow


I have a few questions about kernel instruction flow.

(1) Is a kernel's binary code uploaded to GPU when clEnqueueNDRangeKernel() is executed?

(2) Is it stored in global memory?

(3) Are there special channels/caches for instruction flow to speed it up?

(4) How long a kenel stays on the GPU?

(4) If I invoke a kernel repeatedly, will the binary code be uploaded via the system bus repeatedly?

(5) FetchSize is an important figure summarized by the APP profiler, which shows the total kilobytes fetched from the video memory. Does FetchSize take into account the instruction flow? Or does it only reflect data flow?

Thank you very much in advance.

Vis Cocoa

5 Replies

there is cache for kernel instruction. it have 48~64 kB (can't remember exact value) if you cross this size instructions are loaded from global memory and you experience slowdown. fetchsize AFAIK account only for data.

Thank you very much nou. You provided very useful information.



(3) The VLIW achitectures has 48KB of instruction cache. These are the cards HD4xxx .. HD6xxx. On the HD7xxx they reduced the i.cache size down to 32KB, but it's not a problem because of the better instruction encoding. Usually the same amount of code can be executed on both. VLIW instruction size is exactly 64bits and GCN instruction size can be either 32bits or 64bits. Most of the 2 operand instructions can be encoded on 32bits only. Also the simple constants are using 0 bytes of extra code, because they fit in the register encoding. Those are -16..64, +-0.5, +-1.0, +-2.0 and +-4.0.

Also there is a 32KB constant cache (scalar cache on the GCN) If you run out of instruction cache, you can save some dwords by putting all the literal constants into a read-only constant buffer.

"(4) If I invoke a kernel repeatedly, will the binary code be uploaded via the system bus repeatedly?" No, I think the binary code only transfered once into the GPU memory when you create that program object. (clBuild or something like this)

But anyways, code size is a rather small thing compared to pcie bandwidth.

Hi realhet,

Thank you very much for the detailed information. If the binary code is transferred when clBuildProgram() is called, then a kernel will take some video memory even if it is never invoked. Is that true?

Vis Cocoa


I really don't know... But I guess it's done in lazy fashion, like so much everything in CAL/OpenCL. So when it comes to executing your kernel, the system will first ensure all prerequisites to be present on the GPU ram, like all the resources and the program. But I'm just speculating...