8 Replies Latest reply on May 6, 2010 8:57 PM by Raistmer

    Private array spilled in global memory?

    Raistmer
      In kernel I use:
      float4 d[128];

      Assembly for RV770 gives:
      01 MEM_SCRATCH_WRITE: VEC_PTR[0], R0, ARRAY_SIZE(128) ELEM_SIZE(3)
      ....
      128 MEM_SCRATCH_WRITE: VEC_PTR[127], R0, ARRAY_SIZE(128) ELEM_SIZE(3)
      129 MEM_SCRATCH_WRITE_ACK: VEC_PTR[128], R0, ARRAY_SIZE(128) ELEM_SIZE(3)

      What it means? Whole array spilled into global memory?
      I'm trying to cache data in registers to afoid global memory accesses so register spilling in memory not an option at all...
        • Private array spilled in global memory?
          MicahVillmow
          Raistmer,
          The default setting of OpenCL is to run 4 wavefronts per group. This can overridden with the kernel attribute, __attribute__((reqd_work_group_size(X, Y, Z))). So by default if you use more than ~60 registers, you will spill to memory.
          Also, private arrays will only be pushed into registers only if the indexing pattern is simple or the size is small. Otherwise we use the hardware scratch mechanism which is stored in global memory.

          Use arrays only as a last resort, they are not fast.
          • Private array spilled in global memory?
            MicahVillmow
            Raistmer,
            If you limit the work group size via the above attribute to 64 work-items, this might not spill into memory.
            • Private array spilled in global memory?
              MicahVillmow
              Raistmer,
              The problem is just that the array is to large and it fails some heuristic check that our cal compiler does on if it should attempt to use registers or not.
                • Private array spilled in global memory?
                  Raistmer
                  Originally posted by: MicahVillmow

                  Raistmer,

                  The problem is just that the array is to large and it fails some heuristic check that our cal compiler does on if it should attempt to use registers or not.


                  That is, I should use 128 different variable names to take advantage of so many possible registers per workitem. Pity, indeed
                  Maybe some compiler switches (look at NV's compiler - it has option to limit/set number of registers per thread) that could replace default compiler behavior when needed?
                  In general, compiler can't be clever enough to cover all possible cases, right? Letting some manual control on its desisions can be very useful.
                  Surely my case not very suitable for GPU, but it can be handled much better with already available hardware.