5 Replies Latest reply on May 5, 2012 12:24 AM by realhet

    Questions about kernel instruction flow

    viscocoa

      Hi,

       

      I have a few questions about kernel instruction flow.

       

      (1) Is a kernel's binary code uploaded to GPU when clEnqueueNDRangeKernel() is executed?

      (2) Is it stored in global memory?

      (3) Are there special channels/caches for instruction flow to speed it up?

      (4) How long a kenel stays on the GPU?

      (4) If I invoke a kernel repeatedly, will the binary code be uploaded via the system bus repeatedly?

      (5) FetchSize is an important figure summarized by the APP profiler, which shows the total kilobytes fetched from the video memory. Does FetchSize take into account the instruction flow? Or does it only reflect data flow?

       

      Thank you very much in advance.

       

      Vis Cocoa

        • Re: Questions about kernel instruction flow
          nou

          there is cache for kernel instruction. it have 48~64 kB (can't remember exact value) if you cross this size instructions are loaded from global memory and you experience slowdown. fetchsize AFAIK account only for data.

          1 of 1 people found this helpful
          • Re: Questions about kernel instruction flow
            realhet

            Hi!

            (3) The VLIW achitectures has 48KB of instruction cache. These are the cards HD4xxx .. HD6xxx. On the HD7xxx they reduced the i.cache size down to 32KB, but it's not a problem because of the better instruction encoding. Usually the same amount of code can be executed on both. VLIW instruction size is exactly 64bits and GCN instruction size can be either 32bits or 64bits. Most of the 2 operand instructions can be encoded on 32bits only. Also the simple constants are using 0 bytes of extra code, because they fit in the register encoding. Those are -16..64, +-0.5, +-1.0, +-2.0 and +-4.0.

            Also there is a 32KB constant cache (scalar cache on the GCN) If you run out of instruction cache, you can save some dwords by putting all the literal constants into a read-only constant buffer.

             

            "(4) If I invoke a kernel repeatedly, will the binary code be uploaded via the system bus repeatedly?" No, I think the binary code only transfered once into the GPU memory when you create that program object. (clBuild or something like this)

            But anyways, code size is a rather small thing compared to pcie bandwidth.

            1 of 1 people found this helpful