Archives Discussions

marcoacf · ‎01-23-2015

Hi, guys.

I'm developing a sequence comparison application on OpenCL 1.2, and testing in a AMD R9 280X GPU. Here is the video card information:

DRIVER_VERSION: 1445.5 (VM) Type: GPU EXECUTION_CAPABILITIES: Kernel GLOBAL_MEM_CACHE_TYPE: Read-Write (2) CL_DEVICE_LOCAL_MEM_TYPE: Local (1) SINGLE_FP_CONFIG: 0xbe QUEUE_PROPERTIES: 0x2 VENDOR_ID: 4098 MAX_COMPUTE_UNITS: 32 MAX_WORK_ITEM_DIMENSIONS: 3 MAX_WORK_GROUP_SIZE: 256 PREFERRED_VECTOR_WIDTH_CHAR: 4 PREFERRED_VECTOR_WIDTH_SHORT: 2 PREFERRED_VECTOR_WIDTH_INT: 1 PREFERRED_VECTOR_WIDTH_LONG: 1 PREFERRED_VECTOR_WIDTH_FLOAT: 1 PREFERRED_VECTOR_WIDTH_DOUBLE: 1 MAX_CLOCK_FREQUENCY: 1020 ADDRESS_BITS: 32 MAX_MEM_ALLOC_SIZE: 1073741824 IMAGE_SUPPORT: 1 MAX_READ_IMAGE_ARGS: 128 MAX_WRITE_IMAGE_ARGS: 8 IMAGE2D_MAX_WIDTH: 16384 IMAGE2D_MAX_HEIGHT: 16384 IMAGE3D_MAX_WIDTH: 2048 IMAGE3D_MAX_HEIGHT: 2048 IMAGE3D_MAX_DEPTH: 2048 MAX_SAMPLERS: 16 MAX_PARAMETER_SIZE: 1024 MEM_BASE_ADDR_ALIGN: 2048 MIN_DATA_TYPE_ALIGN_SIZE: 128 GLOBAL_MEM_CACHELINE_SIZE: 64 GLOBAL_MEM_CACHE_SIZE: 16384 GLOBAL_MEM_SIZE: 2893021184 MAX_CONSTANT_BUFFER_SIZE: 65536 MAX_CONSTANT_ARGS: 8 LOCAL_MEM_SIZE: 32768 ERROR_CORRECTION_SUPPORT: 0 PROFILING_TIMER_RESOLUTION: 1 ENDIAN_LITTLE: 1 AVAILABLE: 1 COMPILER_AVAILABLE: 1 MAX_WORK_GROUP_SIZES: 256 256 256

The program is correct and produces the right results (it also runs in other CPU and GPU processors), but the performance is very bad. It seems to me that OpenCL is not using all available cores. The same code runs 50X faster in a Nvidia GTX 680 card.

The code is a little complex, so I'm posting just the host code, once the OpenCL code executes correctly.

err = 0; err = clSetKernelArg(kernel2, 0, sizeof(i0), &i0); err |= clSetKernelArg(kernel2, 1, sizeof(i1), &i1); err |= clSetKernelArg(kernel2, 2, sizeof(step), &step); err |= clSetKernelArg(kernel2, 3, sizeof(cutBlock), &cutBlock); err |= clSetKernelArg(kernel2, 4, sizeof(cl_mem), (void*) &op->d_blockResult); err |= clSetKernelArg(kernel2, 5, sizeof(cl_mem), (void*) &op->d_busH); err |= clSetKernelArg(kernel2, 6, sizeof(cl_mem), (void*) &op->d_extraH); err |= clSetKernelArg(kernel2, 7, sizeof(cl_mem), (void*) &op->d_busV_h); err |= clSetKernelArg(kernel2, 8, sizeof(cl_mem), (void*) &op->d_busV_e); err |= clSetKernelArg(kernel2, 9, sizeof(cl_mem), (void*) &op->d_busV_o); err |= clSetKernelArg(kernel2, 10, sizeof(cl_mem), (void*) &op->d_split_m); err |= clSetKernelArg(kernel2, 11, sizeof(cl_mem), (void*) &op->t_seq0); err |= clSetKernelArg(kernel2, 12, sizeof(cl_mem), (void*) &op->t_seq1); if (err != CL_SUCCESS) exit(0); global = blocks * threads; local = threads; err = clGetKernelWorkGroupInfo(kernel2, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(max_work_groups), &max_work_groups, NULL); if (err != CL_SUCCESS) exit(0); err = clEnqueueNDRangeKernel(commands, kernel2, 1, NULL, &global, &local, 0, NULL, &k_event); if (err != CL_SUCCESS) exit(0); err = clWaitForEvents(1,&k_event); err = clReleaseEvent(k_event);

In a typical execution, global=4096 and local=64, so the work size is greater than the number of GPU cores.

Is there any reason for OpenCL not to use all available cores? Could it be a driver bug (I'm using most recent stable Catalyst driver)?

Thanks in advance.

Marco

ekondis · ‎01-24-2015

4096 workitems in the NDRange for Tahiti GPU is a performance limiting factor by itself. This is because the least minimum of workitems per CU required is 4 wavefronts, 1 per SIMD unit. Therefore, you need at least 256 workitems per CU which gives 32x256=8192 workitems in total. In your case you just utilize the half of compute elements in your GPU. If you also take into account that you cannot hide any memory or pipeline latencies with such a workitem count you can realize that the performance is even worse than half of peak.

For a comparison between the minimum parallelism requirements between GPU architectures have a peek at:

http://parallelplusplus.blogspot.gr/2014/10/least-required-gpu-parallelism-for.html

View solution in original post

ekondis · ‎01-24-2015

4096 workitems in the NDRange for Tahiti GPU is a performance limiting factor by itself. This is because the least minimum of workitems per CU required is 4 wavefronts, 1 per SIMD unit. Therefore, you need at least 256 workitems per CU which gives 32x256=8192 workitems in total. In your case you just utilize the half of compute elements in your GPU. If you also take into account that you cannot hide any memory or pipeline latencies with such a workitem count you can realize that the performance is even worse than half of peak.

For a comparison between the minimum parallelism requirements between GPU architectures have a peek at:

http://parallelplusplus.blogspot.gr/2014/10/least-required-gpu-parallelism-for.html

yurtesen · ‎01-24-2015

In my opinion global size of 4096 is very small to run efficiently. I usually had bad results for less than 50k global sizes. In addition I had register spills when I used complex codes with the AMDs OpenCL drivers. Are you using LDS? Did you try to use CodeXL to see more details about exactly what is taking the time? (also I am not sure if I understand why you think it is not using all the cores?)

realhet · ‎01-24-2015

Hi,

4096 globals with 64 local is only enough for the 50% for your hardware.

Also unless your kernel runs for like 100millisecs or longer, then the queueing of kernels could be an issue.

Try send the card as many workitems as you can in one kernel launch and make sure the work is longer than 100ms.

et1 · ‎01-25-2015

I'd suggest that you profile your code using CodeXL. It shows you what limits your kernel (registers, local memory, etc.), which I'm sure would help.

Ziple · ‎01-25-2015

Have you tried to make local=128 and global=2048 for example? If I remember correctly, wavefronts on AMD GPUs handle 64 threads at the time (while on NVidia GPUs it is 32), so may be you are not benefiting from latency-hiding.

(I don't know if it can really make a difference though, may be the hardware is actually able to mix different work-item execution on the same compute cores, but if it is not the case then you will see some improvements).

natasha · ‎01-26-2015

Try check this with CodeXL application. It provides summary table of features of kernel run. Probably, that information will help you.

marcoacf · ‎01-27-2015

Guys,

Thanks a lot for your responses - they are driving me to the solution!

It really seems the work size is smaller than the GPU capacity, that's why the cores performance is too bad. I'll change the code and perform some tests.

Best Regards

Archives Discussions

OpenCl program not using all available cores in a AMD 280X GPU