7 Replies Latest reply on Jul 28, 2011 12:19 PM by himanshu.gautam

    CPU and APU optimizations

    pratapk
      CPU and APU memory optimizations

      In OpenCL targeted for APU,

      1)  In APU, graphics core shares the main memory ( Instead of VRAM)

      Is it really required to do buffer copy and  usage of local memory and global memory ( Except for synchornization)

      Can't we just use host_ptr, and mapped memory after all it resides in main memory.

      In APU,  Which memory ( Is there something like cache) is used for 32 K memory ?

       

      2) Is there a sample for APU ?

       

      OpenCL Targeted for CPU,

      1)  I thought CPU workgroup size should in the order of 1, as CPU cores are not Stream processors ( I am talking about the warp size)

       But, when queried for max workgroup using clGetDeviceInfo, it gives 1024

      What is the best practise of workgroup size for CPU ( as similar to AMD GPU 64)

       

        • CPU and APU optimizations
          himanshu.gautam

          1)  In APU, graphics core shares the main memory ( Instead of VRAM)

          In APU, the CPU and GPU share the same RAM, although I am not sure about the policy they use.

           

          Is it really required to do buffer copy and  usage of local memory and global memory ( Except for synchornization)

          Can't we just use host_ptr, and mapped memory after all it resides in main memory.

          That depends on the policy used for sharing the RAM. IF the RAM space for GPU and CPU are exclusively defined, you will have to do buffer copies from RAM to RAM.

           

           

          In APU,  Which memory ( Is there something like cache) is used for 32 K memory ?

          CPU and GPU have their own dedicated caches.

           

          2) Is there a sample for APU ?

          Do you have an APU with you. AFAIK, current sdk samples should run on an APU. Do you face some issues.

           

           

           

          OpenCL Targeted for CPU,

          1)  I thought CPU workgroup size should in the order of 1, as CPU cores are not Stream processors ( I am talking about the warp size)

           But, when queried for max workgroup using clGetDeviceInfo, it gives 1024

          What is the best practise of workgroup size for CPU ( as similar to AMD GPU 64)

          As I think, 64 is not a hard rule. It is normally better when kernels are highly computational. If kernels are fetch or write bound, it is better to have more threads assigned to each workgroup. 1024 is the maximum number of workitems supported in a single workgroup. This value is 256 for most GPUs.

           

           

            • CPU and APU optimizations
              pratapk

              Quoted:"That depends on the policy used for sharing the RAM. IF the RAM space for GPU and CPU are exclusively defined, you will have to do buffer copies from RAM to RAM."

              Do you know of the policy for APU,  Buffer copy from RAM to RAM, can't we operate on same set of data ?

               

              Quoted:"Do you have an APU with you. AFAIK, current sdk samples should run on an APU. Do you face some issues."

              Current examples are optimized ( May be targeted) for CPU and discreet GPU combinations, I didn't really find an example optimized for APU.

               

              Quoted:"What is the best practise of workgroup size for CPU ( as similar to AMD GPU 64)

              As I think, 64 is not a hard rule. It is normally better when kernels are highly computational. If kernels are fetch or write bound, it is better to have more threads assigned to each workgroup. 1024 is the maximum number of workitems supported in a single workgroup. This value is 256 for most GPUs."

              Can you be specific to CPU ?

            • CPU and APU optimizations
              maximmoroz

              > Is it really required to do buffer copy and usage of local memory and global memory ( Except for synchornization)

              1. You better use enqeueMapBuffer instead of enqeueReadBuffer/enqeueWriteBuffer. It is fast for APUs as no actual copying occurs.

              2. Intergrated GPU has dedictaed local memory so if algorithm benefits from using local memory then it is better to use it.

              • CPU and APU optimizations
                maximmoroz

                > I thought CPU workgroup size should in the order of 1, as CPU cores are not Stream processors ( I am talking about the warp size)

                Actually, the fact that AMD APP SDK is not able to auto-vectorize kernels when compiling for CPU leads to CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE parameter to be equal to 1. Generally workgroup size should be greater than warp/wavefront size. And it is usually this way.

                > What is the best practise of workgroup size for CPU

                Set it to 64. In most cases such local worksize will be near the most efficient one.