AnsweredAssumed Answered

Can anyone improve on my OpenCL Kernel performance ?

Question asked by sourcery on May 8, 2014
Latest reply on May 13, 2014 by sourcery

Am a relative newbie at GPU programming so I may be making a serious error here.
Am running on an HD6850 an AMD 8150 8 core CPU at 3.6mhz.

I include a stripped down kernel.
It works in that the GPU kernel produces exactly the same answers as a CPU routine doing the same thing serially.
The kernel gets called around 78000 times. However it seems to be slower than the CPU serial code.

The program spends approximately 25 seconds using the GPU to do the work and 3.5 seconds for a single CPU core to do the work.
Now I understand that kernel launches are expensive, so I timed it for the same GPU kernel that simply returns on
entry, the time then was 14 seconds, so 11 seconds real GPU time.
So we still have the GPU taking 3 times as long as the CPU if we ignore the kernel overhead.


Compilation/Build options are left as NULL.
I have played around with number of workgroups in use, with very little if any improvement.
I have used a parallel reduction process in the kernel, very little effect on speed so removed from the kernel for simplicity here.

One thing that might help that not yet tried is to reduce the global_size(0), and make the kernel process multiple work items.

I guess I must be doing something wrong, or my GPU is broken, GPUs are supposed to be fast !


Kernel follows :


__kernel void P1Kernelz(
   global  const GPUMETRICS * restrict gpumetrics, // Tables randomish reads
      global  const XSMINT         * restrict gpuxsm,   // Read  once oer call
   global          ushort           *            gpuresults,  // Write once per call
   global  const GPUCALCX     *            gpucalcx   // Data for item to find minimum for
__global  const XSMINT *pgpuxsm;

int  score;
int  minscore = gpucalcx->bestdistinsu;
ulong al, ah;
ulong bl, bh;
ulong ab;
uint glindex;

int  txbxknown   = gpucalcx->txbxknown;
ushort maxrsbdec = gpucalcx->maxrsbdec;
ushort gflagsf       = gpucalcx->gflagsf;
short bx        = gpucalcx->bx;
short tx         = gpucalcx->tx;
short hx        = gpucalcx->hx;
short wx       = gpucalcx->wx;
short ax        = gpucalcx->ax;

global const ulong *va = (global const ulong *)  &gpucalcx->glyph[0];
global const ulong  *vb;
global const ushort  *nbtab= &gpumetrics->nb08tabp1x[0];

//uint  locid  = get_local_id(0);
uint  glbid  = get_global_id(0);

//uint  maxlocid = get_local_size(0);
uint   listlen = gpucalcx->listlen;


  // Global_size has been padded to a multiple of 256, to enable use of maximum local_size, hence need for listlen
  if (glbid < listlen) {

  // Have removed the calculation of score,
  // Its a complex calculation of score involving 140 statements, 7 table lookups to different global const memory tables
  // and 64 table lookups to a single global const memory table
  // Table dimensions are roughly 256 items per table
   if (score > minscore) score =  minscore;
    gpuresults[glbid] = score;