3 Replies Latest reply on Oct 12, 2009 3:13 PM by alexaverbuch

    OpenCL caches

    bubu
      OpenCL caches

      I have several doubts about the caches in the ATI's OpenCL implementation. Could anybody clarify me this, pls?

       

      1. Texture cache. Each time I sample a texture I suppose the returned value will be cached... Size pls? 4Kb-6Kb? How good is the cache if all the working units access the same data? Do I need to coalesce this?

       

      2. Local cache. Is local memory cached? Need coalescing? Size? 16Kb?

       

      3. Constant cache. More of the same... Size? 64Kb? Coalescing? How good is the cache if all the working units access the same data? How good if each working unit accesses unaligned data?

       

      3. Global cache. Is global memory cached? Size?

       

      I think you should clarify all this with nice visual graphs like it's done in the CUDA documents... with coloured blocks, arrows, etc... And put a table with the sizes, optimal alignments, etc...

       

      also other question.... Is it possible inside a kernel to move data from the global memory to the constant cache or can be done only using a clXXXXXX host function before the kernel is executed?

       

      thx.

       

        • OpenCL caches
          n0thing

          ATI's current OpenCL implementation is only for the CPU and caching is automatic from the main memory, so everything is cached I guess except for texture buffers as the textures are not supported in CPU's implementation [ requires Fixed function logic like texture units and samplers ]

          For the GPU implementation here are my predictions:

          1. Texture Cache : There is a texture cache per SIMD unit, 8kb I think(on rv770). Texture caches are optimized for spatial coherence in texture fetches so you don't need to coalesce as it is automatically done by the tiled rasterization order (fetching a quad of texels) of textures. 

          2. Local memory on rv770(LDS) is 16KB per SIMD unit, (R800 should be 32kb as it should support DX11). This memory is configured with 4 banks, each with 256 entries of 16 bytes. So you can read upto 4 aligned 32 bit words in 1 read access from the LDS. Writes have no bank conflicts as each thread can only write to its private location, hence the LDS is not as generic as shared memory specified by OpenCL specification. R800 should support OpenCL's shared memory.

          3. Constant cache is 64KB, no idea about coalescing.

          4. OpenCL specification says : Reads and writes to global memory may be cached depending on the capabilities of the device.

          Here is what OpenCL specification says about constant address space :

          The __constant or constant address space name is used to describe variables allocated in global memory and which are accessed inside a kernel(s) as read-only variables.  These read-only variables can be accessed by all (global) work-items of the kernel during its execution.  This
          qualifier can be used with arguments to functions (including __kernel functions) that are declared as pointers, or with local variables inside a function declared as pointers, or with global variables.  Global variables declared in the program source with the __constant qualifier are
          required to be initialized.

            • OpenCL caches
              omkaranathan

              These caches are dependent on hardware and not OpenCL implementation.

              You can get the cache sizes using clgetDeviceInfo() API call

              OpenCL CPU implementation:

              Texture Cache: Textures (images) are not supported in CPU implementation.

              Local cache: There is no local cache, but local memory.

              Constant cache, Global cache: Dependent on CPU.


              GPU Implementation is not out yet.

            • OpenCL caches
              alexaverbuch

              I second this

               

              Originally posted by: bubu

              I think you should clarify all this with nice visual graphs like it's done in the CUDA documents... with coloured blocks, arrows, etc... And put a table with the sizes, optimal alignments, etc...