cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

FangQ
Adept I

OpenCL performance on multicore CPU

hi

I just got my first OpenCL code working. There are still a lot of things needed to be fine tuned and digested. One of those is the CPU load when running the code on a multicore CPU.

My computer has an intel quad-core (Q6700) CPU and a Radeon 4650 card, I first called clGetPlatformIDs() and it returned 1 platform, called "ATI Stream". Then, I used clCreateContextFromType() created a CPU context from this platform. Calliing clGetContextInfo() returned 4 devices, which I assume they are the 4 cores of the CPU. Then, I created a command queue for device[0], I thought that it attached a queue for the first core of the CPU. However, when I launched my kernel for this command queue, I saw my CPU load jumped to 400%, indicating all cores are used.

Can anyone explain to me what happened? do you expect the call

commands=clCreateCommandQueue(context,devices[0] ... )

limit all the subsequent computation to a single core of the CPU? or stream sdk is smart enough to expand it to all available devices within this context?

 

In addition, my card is supposed to have 320 cores, but when I ran CLInfo, it showed only 8 compute units. is this right? (running my code on GPU was a lot slower than CPU )

0 Likes
4 Replies
nou
Exemplar

how do you that know clGetContextInfo() returned 4 devices. if you mean returned size than it is in bytes not count. sou you you must divide value returned from clGetContextInfo()by sizeof(size_t) a presume you use 32 bit system so 4/sizeof(size_t) = 1.

that is correct value because OpenCL treat CPU as one device with 4 cores.

clGetDeviceIDs() return count not size in byte.

GPU have 8 cores. each core contain 8 VLIW which is 5 unit wide. so 8*8*5 = 320

0 Likes

Originally posted by: nou how do you that know clGetContextInfo() returned 4 devices. if you mean returned size than it is in bytes not count. sou you you must divide value returned from clGetContextInfo()by sizeof(size_t) a presume you use 32 bit system so 4/sizeof(size_t) = 1.

that is correct value because OpenCL treat CPU as one device with 4 cores.

clGetDeviceIDs() return count not size in byte.



I see.

In OpenCL, is there a way to specify just one core? I am trying to run some tests with various number of cores and benchmark the performance of the code wrt core numbers.

 

 

0 Likes

Originally posted by: nouGPU have 8 cores. each core contain 8 VLIW which is 5 unit wide. so 8*8*5 = 320


I am curious if there is a general way to estimate the acceleration of a code using ATI card given its performance on an nVidia card (assuming no atomic operations, all floating point)?

My code was originally written in CUDA, and had achieved >300x speed-up on a 8800GT card (112 nvidia cores, 14MP, 1792 threads with 128 thread blocks) compared to a Xeon 64bit CPU. I am wondering what kind of speed-up I would expect with this OpenCL port and the 4650 card (I also ordered a 4890OC a few days ago).

0 Likes

you should order some 58xx card. 4xxx series have some performance restriction for example emulated global memory.

you can specify enviroment variable CPU_MAX_COMPUTE_UNITS=2 and it will use only 2 cores.

0 Likes