4 Replies Latest reply on Feb 7, 2010 5:49 AM by nou

    OpenCL performance on multicore CPU

    FangQ

      hi

      I just got my first OpenCL code working. There are still a lot of things needed to be fine tuned and digested. One of those is the CPU load when running the code on a multicore CPU.

      My computer has an intel quad-core (Q6700) CPU and a Radeon 4650 card, I first called clGetPlatformIDs() and it returned 1 platform, called "ATI Stream". Then, I used clCreateContextFromType() created a CPU context from this platform. Calliing clGetContextInfo() returned 4 devices, which I assume they are the 4 cores of the CPU. Then, I created a command queue for device[0], I thought that it attached a queue for the first core of the CPU. However, when I launched my kernel for this command queue, I saw my CPU load jumped to 400%, indicating all cores are used.

      Can anyone explain to me what happened? do you expect the call

      [code]commands=clCreateCommandQueue(context,devices[0] ... )[/code]

      limit all the subsequent computation to a single core of the CPU? or stream sdk is smart enough to expand it to all available devices within this context?

       

      In addition, my card is supposed to have 320 cores, but when I ran CLInfo, it showed only 8 compute units. is this right? (running my code on GPU was a lot slower than CPU )

        • OpenCL performance on multicore CPU
          nou

          how do you that know clGetContextInfo() returned 4 devices. if you mean returned size than it is in bytes not count. sou you you must divide value returned from clGetContextInfo()by sizeof(size_t) a presume you use 32 bit system so 4/sizeof(size_t) = 1.

          that is correct value because OpenCL treat CPU as one device with 4 cores.

          clGetDeviceIDs() return count not size in byte.

          GPU have 8 cores. each core contain 8 VLIW which is 5 unit wide. so 8*8*5 = 320

            • OpenCL performance on multicore CPU
              FangQ

               

              Originally posted by: nou how do you that know clGetContextInfo() returned 4 devices. if you mean returned size than it is in bytes not count. sou you you must divide value returned from clGetContextInfo()by sizeof(size_t) a presume you use 32 bit system so 4/sizeof(size_t) = 1.

              that is correct value because OpenCL treat CPU as one device with 4 cores.

              clGetDeviceIDs() return count not size in byte.



              I see.

              In OpenCL, is there a way to specify just one core? I am trying to run some tests with various number of cores and benchmark the performance of the code wrt core numbers.

               

               

              • OpenCL performance on multicore CPU
                FangQ

                 

                Originally posted by: nouGPU have 8 cores. each core contain 8 VLIW which is 5 unit wide. so 8*8*5 = 320


                I am curious if there is a general way to estimate the acceleration of a code using ATI card given its performance on an nVidia card (assuming no atomic operations, all floating point)?

                My code was originally written in CUDA, and had achieved >300x speed-up on a 8800GT card (112 nvidia cores, 14MP, 1792 threads with 128 thread blocks) compared to a Xeon 64bit CPU. I am wondering what kind of speed-up I would expect with this OpenCL port and the 4650 card (I also ordered a 4890OC a few days ago).