cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

marcoacf
Journeyman III

OpenCl program not using all available cores in a AMD 280X GPU

Hi, guys.

I'm developing a sequence comparison application on OpenCL 1.2, and testing in a AMD R9 280X GPU. Here is the video card information:

DRIVER_VERSION: 1445.5 (VM)
Type: GPU
EXECUTION_CAPABILITIES: Kernel
GLOBAL_MEM_CACHE_TYPE: Read-Write (2)
CL_DEVICE_LOCAL_MEM_TYPE: Local (1)
SINGLE_FP_CONFIG: 0xbe
QUEUE_PROPERTIES: 0x2
VENDOR_ID: 4098
MAX_COMPUTE_UNITS: 32
MAX_WORK_ITEM_DIMENSIONS: 3
MAX_WORK_GROUP_SIZE: 256
PREFERRED_VECTOR_WIDTH_CHAR: 4
PREFERRED_VECTOR_WIDTH_SHORT: 2
PREFERRED_VECTOR_WIDTH_INT: 1
PREFERRED_VECTOR_WIDTH_LONG: 1
PREFERRED_VECTOR_WIDTH_FLOAT: 1
PREFERRED_VECTOR_WIDTH_DOUBLE: 1
MAX_CLOCK_FREQUENCY: 1020
ADDRESS_BITS: 32
MAX_MEM_ALLOC_SIZE: 1073741824
IMAGE_SUPPORT: 1
MAX_READ_IMAGE_ARGS: 128
MAX_WRITE_IMAGE_ARGS: 8
IMAGE2D_MAX_WIDTH: 16384
IMAGE2D_MAX_HEIGHT: 16384
IMAGE3D_MAX_WIDTH: 2048
IMAGE3D_MAX_HEIGHT: 2048
IMAGE3D_MAX_DEPTH: 2048
MAX_SAMPLERS: 16
MAX_PARAMETER_SIZE: 1024
MEM_BASE_ADDR_ALIGN: 2048
MIN_DATA_TYPE_ALIGN_SIZE: 128
GLOBAL_MEM_CACHELINE_SIZE: 64
GLOBAL_MEM_CACHE_SIZE: 16384
GLOBAL_MEM_SIZE: 2893021184
MAX_CONSTANT_BUFFER_SIZE: 65536
MAX_CONSTANT_ARGS: 8
LOCAL_MEM_SIZE: 32768
ERROR_CORRECTION_SUPPORT: 0
PROFILING_TIMER_RESOLUTION: 1
ENDIAN_LITTLE: 1
AVAILABLE: 1
COMPILER_AVAILABLE: 1
MAX_WORK_GROUP_SIZES: 256 256 256

The program is correct and produces the right results (it also runs in other CPU and GPU processors), but the performance is very bad. It seems to me that OpenCL is not using all available cores. The same code runs 50X faster in a Nvidia GTX 680 card.

The code is a little complex, so I'm posting just the host code, once the OpenCL code executes correctly.

  err = 0;
  err = clSetKernelArg(kernel2, 0, sizeof(i0), &i0);
  err |= clSetKernelArg(kernel2, 1, sizeof(i1), &i1);
  err |= clSetKernelArg(kernel2, 2, sizeof(step), &step);
  err |= clSetKernelArg(kernel2, 3, sizeof(cutBlock), &cutBlock);
  err |= clSetKernelArg(kernel2, 4, sizeof(cl_mem), (void*) &op->d_blockResult);
  err |= clSetKernelArg(kernel2, 5, sizeof(cl_mem), (void*) &op->d_busH);
  err |= clSetKernelArg(kernel2, 6, sizeof(cl_mem), (void*) &op->d_extraH);
  err |= clSetKernelArg(kernel2, 7, sizeof(cl_mem), (void*) &op->d_busV_h);
  err |= clSetKernelArg(kernel2, 8, sizeof(cl_mem), (void*) &op->d_busV_e);
  err |= clSetKernelArg(kernel2, 9, sizeof(cl_mem), (void*) &op->d_busV_o);
  err |= clSetKernelArg(kernel2, 10, sizeof(cl_mem), (void*) &op->d_split_m);
  err |= clSetKernelArg(kernel2, 11, sizeof(cl_mem), (void*) &op->t_seq0);
  err |= clSetKernelArg(kernel2, 12, sizeof(cl_mem), (void*) &op->t_seq1);

  if (err != CL_SUCCESS)
  exit(0);

  global = blocks * threads;
  local = threads;

  err = clGetKernelWorkGroupInfo(kernel2, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(max_work_groups), &max_work_groups, NULL);
  if (err != CL_SUCCESS)
  exit(0);

  err = clEnqueueNDRangeKernel(commands, kernel2, 1, NULL, &global, &local, 0, NULL, &k_event);
  if (err != CL_SUCCESS)
  exit(0);

  err = clWaitForEvents(1,&k_event);
  err = clReleaseEvent(k_event);

In a typical execution, global=4096 and local=64, so the work size is greater than the number of GPU cores.

Is there any reason for OpenCL not to use all available cores? Could it be a driver bug (I'm using most recent stable Catalyst driver)?

Thanks in advance.

Marco

0 Likes
1 Solution
ekondis
Adept II

4096 workitems in the NDRange for Tahiti GPU is a performance limiting factor by itself. This is because the least minimum of workitems per CU required is 4 wavefronts, 1 per SIMD unit. Therefore, you need at least 256 workitems per CU which gives 32x256=8192 workitems in total. In your case you just utilize the half of compute elements in your GPU. If you also take into account that you cannot hide any memory or pipeline latencies with such a workitem count you can realize that the performance is even worse than half of peak.

For a comparison between the minimum parallelism requirements between GPU architectures have a peek at:

http://parallelplusplus.blogspot.gr/2014/10/least-required-gpu-parallelism-for.html

View solution in original post

7 Replies
ekondis
Adept II

4096 workitems in the NDRange for Tahiti GPU is a performance limiting factor by itself. This is because the least minimum of workitems per CU required is 4 wavefronts, 1 per SIMD unit. Therefore, you need at least 256 workitems per CU which gives 32x256=8192 workitems in total. In your case you just utilize the half of compute elements in your GPU. If you also take into account that you cannot hide any memory or pipeline latencies with such a workitem count you can realize that the performance is even worse than half of peak.

For a comparison between the minimum parallelism requirements between GPU architectures have a peek at:

http://parallelplusplus.blogspot.gr/2014/10/least-required-gpu-parallelism-for.html

yurtesen
Miniboss

In my opinion global size of 4096 is very small to run efficiently. I usually had bad results for less than 50k global sizes. In addition I had register spills when I used complex codes with the AMDs OpenCL drivers. Are you using LDS? Did you try to use CodeXL to see more details about exactly what is taking the time? (also I am not sure if I understand why you think it is not using all the cores?)

realhet
Miniboss

Hi,

4096 globals with 64 local is only enough for the 50% for your hardware.

Also unless your kernel runs for like 100millisecs or longer, then the queueing of kernels could be an issue.

Try send the card as many workitems as you can in one kernel launch and make sure the work is longer than 100ms.

et1
Adept II

I'd suggest that you profile your code using CodeXL. It shows you what limits your kernel (registers, local memory, etc.), which I'm sure would help.

Ziple
Adept I

Have you tried to make local=128 and global=2048 for example? If I remember correctly, wavefronts on AMD GPUs handle 64 threads at the time (while on NVidia GPUs it is 32), so may be you are not benefiting from latency-hiding.

(I don't know if it can really make a difference though, may be the hardware is actually able to mix different work-item execution on the same compute cores, but if it is not the case then you will see some improvements).

natasha
Elite

Try check this with CodeXL application. It provides summary table of features of kernel run. Probably, that information will help you.

marcoacf
Journeyman III

Guys,

Thanks a lot for your responses - they are driving me to the solution!

It really seems the work size is smaller than the GPU capacity, that's why the cores performance is too bad. I'll change the code and perform some tests.

Best Regards

0 Likes