7 Replies Latest reply on Jan 27, 2015 6:31 AM by marcoacf

    OpenCl program not using all available cores in a AMD 280X GPU

    marcoacf

      Hi, guys.

      I'm developing a sequence comparison application on OpenCL 1.2, and testing in a AMD R9 280X GPU. Here is the video card information:

       

      DRIVER_VERSION: 1445.5 (VM)
      Type: GPU
      EXECUTION_CAPABILITIES: Kernel
      GLOBAL_MEM_CACHE_TYPE: Read-Write (2)
      CL_DEVICE_LOCAL_MEM_TYPE: Local (1)
      SINGLE_FP_CONFIG: 0xbe
      QUEUE_PROPERTIES: 0x2
      VENDOR_ID: 4098
      MAX_COMPUTE_UNITS: 32
      MAX_WORK_ITEM_DIMENSIONS: 3
      MAX_WORK_GROUP_SIZE: 256
      PREFERRED_VECTOR_WIDTH_CHAR: 4
      PREFERRED_VECTOR_WIDTH_SHORT: 2
      PREFERRED_VECTOR_WIDTH_INT: 1
      PREFERRED_VECTOR_WIDTH_LONG: 1
      PREFERRED_VECTOR_WIDTH_FLOAT: 1
      PREFERRED_VECTOR_WIDTH_DOUBLE: 1
      MAX_CLOCK_FREQUENCY: 1020
      ADDRESS_BITS: 32
      MAX_MEM_ALLOC_SIZE: 1073741824
      IMAGE_SUPPORT: 1
      MAX_READ_IMAGE_ARGS: 128
      MAX_WRITE_IMAGE_ARGS: 8
      IMAGE2D_MAX_WIDTH: 16384
      IMAGE2D_MAX_HEIGHT: 16384
      IMAGE3D_MAX_WIDTH: 2048
      IMAGE3D_MAX_HEIGHT: 2048
      IMAGE3D_MAX_DEPTH: 2048
      MAX_SAMPLERS: 16
      MAX_PARAMETER_SIZE: 1024
      MEM_BASE_ADDR_ALIGN: 2048
      MIN_DATA_TYPE_ALIGN_SIZE: 128
      GLOBAL_MEM_CACHELINE_SIZE: 64
      GLOBAL_MEM_CACHE_SIZE: 16384
      GLOBAL_MEM_SIZE: 2893021184
      MAX_CONSTANT_BUFFER_SIZE: 65536
      MAX_CONSTANT_ARGS: 8
      LOCAL_MEM_SIZE: 32768
      ERROR_CORRECTION_SUPPORT: 0
      PROFILING_TIMER_RESOLUTION: 1
      ENDIAN_LITTLE: 1
      AVAILABLE: 1
      COMPILER_AVAILABLE: 1
      MAX_WORK_GROUP_SIZES: 256 256 256

       

      The program is correct and produces the right results (it also runs in other CPU and GPU processors), but the performance is very bad. It seems to me that OpenCL is not using all available cores. The same code runs 50X faster in a Nvidia GTX 680 card.

      The code is a little complex, so I'm posting just the host code, once the OpenCL code executes correctly.

       

        err = 0;
        err = clSetKernelArg(kernel2, 0, sizeof(i0), &i0);
        err |= clSetKernelArg(kernel2, 1, sizeof(i1), &i1);
        err |= clSetKernelArg(kernel2, 2, sizeof(step), &step);
        err |= clSetKernelArg(kernel2, 3, sizeof(cutBlock), &cutBlock);
        err |= clSetKernelArg(kernel2, 4, sizeof(cl_mem), (void*) &op->d_blockResult);
        err |= clSetKernelArg(kernel2, 5, sizeof(cl_mem), (void*) &op->d_busH);
        err |= clSetKernelArg(kernel2, 6, sizeof(cl_mem), (void*) &op->d_extraH);
        err |= clSetKernelArg(kernel2, 7, sizeof(cl_mem), (void*) &op->d_busV_h);
        err |= clSetKernelArg(kernel2, 8, sizeof(cl_mem), (void*) &op->d_busV_e);
        err |= clSetKernelArg(kernel2, 9, sizeof(cl_mem), (void*) &op->d_busV_o);
        err |= clSetKernelArg(kernel2, 10, sizeof(cl_mem), (void*) &op->d_split_m);
        err |= clSetKernelArg(kernel2, 11, sizeof(cl_mem), (void*) &op->t_seq0);
        err |= clSetKernelArg(kernel2, 12, sizeof(cl_mem), (void*) &op->t_seq1);

        if (err != CL_SUCCESS)
        exit(0);

        global = blocks * threads;
        local = threads;

        err = clGetKernelWorkGroupInfo(kernel2, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(max_work_groups), &max_work_groups, NULL);
        if (err != CL_SUCCESS)
        exit(0);

        err = clEnqueueNDRangeKernel(commands, kernel2, 1, NULL, &global, &local, 0, NULL, &k_event);
        if (err != CL_SUCCESS)
        exit(0);

        err = clWaitForEvents(1,&k_event);
        err = clReleaseEvent(k_event);

       

      In a typical execution, global=4096 and local=64, so the work size is greater than the number of GPU cores.

       

      Is there any reason for OpenCL not to use all available cores? Could it be a driver bug (I'm using most recent stable Catalyst driver)?

       

      Thanks in advance.

       

      Marco