cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

sourcery
Journeyman III

Can anyone improve on my OpenCL Kernel performance ?

Am a relative newbie at GPU programming so I may be making a serious error here.
Am running on an HD6850 an AMD 8150 8 core CPU at 3.6mhz.

I include a stripped down kernel.
It works in that the GPU kernel produces exactly the same answers as a CPU routine doing the same thing serially.
The kernel gets called around 78000 times. However it seems to be slower than the CPU serial code.

The program spends approximately 25 seconds using the GPU to do the work and 3.5 seconds for a single CPU core to do the work.
Now I understand that kernel launches are expensive, so I timed it for the same GPU kernel that simply returns on
entry, the time then was 14 seconds, so 11 seconds real GPU time.
So we still have the GPU taking 3 times as long as the CPU if we ignore the kernel overhead.

Compilation/Build options are left as NULL.
I have played around with number of workgroups in use, with very little if any improvement.
I have used a parallel reduction process in the kernel, very little effect on speed so removed from the kernel for simplicity here.

One thing that might help that not yet tried is to reduce the global_size(0), and make the kernel process multiple work items.

I guess I must be doing something wrong, or my GPU is broken, GPUs are supposed to be fast !

Kernel follows :

__kernel void P1Kernelz(
   global  const GPUMETRICS * restrict gpumetrics, // Tables randomish reads
      global  const XSMINT         * restrict gpuxsm,   // Read  once oer call
   global          ushort           *            gpuresults,  // Write once per call
   global  const GPUCALCX     *            gpucalcx   // Data for item to find minimum for
   )
{
__global  const XSMINT *pgpuxsm;

int  score;
int  minscore = gpucalcx->bestdistinsu;
ulong al, ah;
ulong bl, bh;
ulong ab;
uint glindex;

int  txbxknown   = gpucalcx->txbxknown;
ushort maxrsbdec = gpucalcx->maxrsbdec;
ushort gflagsf       = gpucalcx->gflagsf;
short bx        = gpucalcx->bx;
short tx         = gpucalcx->tx;
short hx        = gpucalcx->hx;
short wx       = gpucalcx->wx;
short ax        = gpucalcx->ax;

global const ulong *va = (global const ulong *)  &gpucalcx->glyph[0];
global const ulong  *vb;
global const ushort  *nbtab= &gpumetrics->nb08tabp1x[0];

//uint  locid  = get_local_id(0);
uint  glbid  = get_global_id(0);

//uint  maxlocid = get_local_size(0);
uint   listlen = gpucalcx->listlen;

  // Global_size has been padded to a multiple of 256, to enable use of maximum local_size, hence need for listlen
  if (glbid < listlen) {

  // Have removed the calculation of score,
  // Its a complex calculation of score involving 140 statements, 7 table lookups to different global const memory tables
  // and 64 table lookups to a single global const memory table
  // Table dimensions are roughly 256 items per table
  }
 
   if (score > minscore) score =  minscore;
    gpuresults[glbid] = score;
}
//}
return;
}

0 Likes
11 Replies
sourcery
Journeyman III

Sad to be replying to my own post, but as I find often happens, you send an email, then you realise what's wrong with it 😞

I have underestimated the kernel call + I/O overhead, I did not realise that the call count to the kernel would drop if I simply

returned from the kernel after entry. Sorry this is a real application that  am trying to improve and things like that happen when

the results are different.

With a parallel reduction on the GPI the 76k calls to the kernel take around 23.88 secs,

Keeping the call count the same and with a simple return on kernel entry the 76k calls take around 21.0 secs.

So if we ignore the call overhead the GPU time at 2.88 secs is faster than the 3.5 secs for a single core CPU processing time.

It's still a rubbish time, as in why use a GPU ?  I'm not sure how to calculate the number of cores the GPU is using but it must

be several hundred cores. I would have expected a x10 improvement at least.

I am only slightly relaxed about the call overhead, there are 2 strategies I intend to employ if this project ever gets to lift off :

1. The Kaveri chip might be a lot better at this (a migration option for us).

2. If that fails, I ought to be able to make the kernel a lot bigger (as in do more), however the outlined routine is the only one that I think

    will benefit from a parallel processing GPU, the "bigger" comes from transferring a lot of serial only CPU code to work serially

    on the GPU and hoping that an occasional call to the GPU/APU (say 1100 as opposed to 78k) will make the overhead disappear.

.

0 Likes

Ok, but next time don't write benchmarking code. Just drop your program into CodeXL and do an Application Timeline Trace profile as a start.

You'll still have enough to guess (such as: how do I pack my values in memory for optimal random-ish indirected read performance?), but at least you have real GPU-side measurements to consider.

Hi maxdz8,

Well I'm sure I said I was a newbie. Been using CodeXL to get the compilation errors out of the kernel and that's it.

Are you suggesting I can drop a whole Visual C project into CodeXL ?

That sounds unlikely, last time I ported some Visual C code over I spent a few days finding the bug that was entirely

due to differences in structure packing between Visual C and OpenCL.

If you meant drop the kernel into CodeXL well Ok that's kind of done, I would not trust any profiling without real data and

lots of samples though.

I am sure I have a lot to learn.

All the best,

Steve


0 Likes

sourcery, just a few comments:

  • If your table lookups can fit in constant memory, you can put them there and define the array with __constant.  This is different than __global const memory and may have improved performance.
  • Since most of your code is missing, it's hard to tell, but you seem to have a bandwidth bound kernel call (7+64 reads, 1 write, O(??) math operations).  You called this a reduction operation.  If you're copying new data from the host to the discrete GPU for each kernel invocation, your bottleneck is in the PCIe bus (~5 GB/s).  Reduction operations on GPUs generally only make sense when the data is resident on the card.  No matter how fast you reduce the data that you copied over, your effective wall clock speed will be limited by the PCIe speed.  It's important to understand this
  • GPUs don't have several hundred cores.  They have roughly single digits to dozens of Compute Units (CUs). Hawaii has 44 CUs.  Each of those Compute Units with AMD's GPU architecture has 64 Processing Elements.  If you launch a kernel with 1024 threads and a local work size of 64, you will be using 1024/64 = 16 CUs.  If your GPU doesn't have 16 CUs, the scheduler will issue multiple wave fronts.  Basically, you want as many threads as possible...on the order of 10's of thousands of threads.
  • You might try using persistent threads by moving your outer loop with 76k kernel invocations inside the kernel code itself.
  • Or you can make your global number of threads greater to capture all 76k kernel calls with one kernel launch.
  • There are a lot of optimization strategies that I could discuss if I could see more code.


Thanks for the suggestions Jross. Regarding your comments :

I tried putting the most used table (512 bytes) iinto constant memory (on the stack presumably), changing

global const ushort  *nbtab= &gpumetrics->nb08tabp1x[0];

to

const ushort nbtab[256] = {values};

Seemed to make it worse.

Only GPUCALX and the results change for every kernel run, the results are 4 bytes long, so hopefully not a PCIe bus problem.

Persistent threads is something that might be possible with Kaveri where the kernel can communicate and wait for data with/from the CPU.

The problem I have (with a GPU) is that the results of one kernel call changes the data in GPUCALCX for the next one, in a complexish way.

All the control code that calls the kernel could be ported to the GPU but it's almost all needs to run in a serial way.

I have difficulty understanding why Opencl starts a thread for every item in global_size (), I would have thought that it would be more efficient

to start threads only for the number of cores and let the kernel  iterate through the global_ids appropriately. That way global memory tables could

be efficiently copied to faster storage since the kernel is in a loop. If every kernel thread only executes one item then there is nothing to be

gained.

Anyway here is the complete kernel (tidied up) :

#define tablim     256
#define mapydim     16
#define bitstoproc     4
#define  maxlocal  256


typedef struct tagGPUMETRICS
{
#define  max_pscore    32000
float  unused_a[max_pscore+1];

ushort nb08tabp1x[256];    
ushort unused_b[256];     

int  isbtabfd1[33];      
int  igftabfd1[64];      
int  iaxtabfd1[(tablim<<1)+1];
int  ibxtabfd1[(tablim<<1)+1];
int  itxtabfd1[(tablim<<1)+1];
int  ihxtabfd1[(tablim<<1)+1];
int  iwxtabfd1[(tablim<<1)+1];

int  unused_c[33];     
int  unused_d[64];     
int  unused_e[(tablim<<1)+1];
int  unused_f[(tablim<<1)+1];
int  unused_g[(tablim<<1)+1];
int  unused_h[(tablim<<1)+1];
int  unused_i[(tablim<<1)+1];
} GPUMETRICS;

typedef struct tagXSMINT
{
ulong  glyph[mapydim>>1];
ushort  gflags;
ushort  padding;
short  bx;
short  tx;
short  hx;
short  wx;
short  ax;
ushort  maxrsbdec;
ushort  iweight;
ushort  unused_a;
uint    unused_b;
ushort  unused_c;
ushort  unused_d;
ushort  unused_e;
ushort  unused_f;
} XSMINT;

typedef struct tagGPUCALCX
{
ulong glyph[mapydim>>1];
ushort gflagsf;
ushort padding;
short bx;
short tx;
short hx;
short wx;
short ax;
ushort maxrsbdec;

int  unused_a;
int  unused_b;
int  unused_c;
int  unused_d;

float unused_e;
ushort unused_f;

int  mindistinsu;
int  listlen;
} GPUCALCX;

// Finds nearest match for gpucalcx item in set of gpuxsm
kernel void P1Kernel (
   global  const GPUMETRICS * restrict gpumetrics, // Tables randomish reads (Unchanged between calls)
      global  const XSMINT         * restrict gpuxsm,   // Read  once oer call approx 3000 rows  (Unchanged bretween calls)
   global  const GPUCALCX     *            gpucalcx,   // Data for item to find minimum for (Different for each call)
   global          ushort           *            p1results    // Write once per call (only last 4 bytes read by CPU)
   )
{

global  const XSMINT *pgpuxsm;

int  score;
int  minscore = gpucalcx->mindistinsu;
ulong al, ah;
ulong bl, bh;
ulong ab;
uint glindex;

global const ulong *va = (global const ulong *)  &gpucalcx->glyph[0];
global const ulong  *vb;
 
#if bitstoproc == 4
global const ushort  *nbtab= &gpumetrics->nb08tabp1x[0];
#endif

local ushort  localminscore[maxlocal];
local ushort  localminindex[maxlocal];

uint  locid  = get_local_id(0);
uint  glbid  = get_global_id(0);
uint   listlen = gpucalcx->listlen;
uint  maxlocid = get_local_size(0);    // Code assumes this is a power of 2 and < maxlocal

  if (glbid < listlen) {
  pgpuxsm = &gpuxsm[glbid];
 
  score   = gpumetrics->iaxtabfd1[gpucalcx->ax        - pgpuxsm->ax];
  score += gpumetrics->isbtabfd1[gpucalcx->maxrsbdec - pgpuxsm->maxrsbdec];
  score += gpumetrics->igftabfd1[gpucalcx->gflagsf   ^ pgpuxsm->gflags];
  score += gpumetrics->ibxtabfd1[gpucalcx->bx        - pgpuxsm->bx];
  score += gpumetrics->itxtabfd1 [gpucalcx->tx         - pgpuxsm->tx];
  score += gpumetrics->ihxtabfd1[gpucalcx->hx        - pgpuxsm->hx];
  score += gpumetrics->iwxtabfd1[gpucalcx->wx       - pgpuxsm->wx];
  score >>= 4;      // Expanded by 16 for greater metric resolution

  // Process 2 rows at a time of 8
  vb   = (global const ulong *) &pgpuxsm->glyph[0];
  for (glindex=0; glindex<mapydim>>1; glindex+=2) {

   bl = vb[glindex+0];
   bh = vb[glindex+1];
   al = va[glindex+0];
   ah = va[glindex+1];

   ab = ~((bl|bh)&(al|ah));
   al |= bl;   
   ah |= bh;  
   al &= ab;  
   ah &= ab;  

    
#if bitstoproc == 4
   // bl & bh could be broken up into 16 bit chunks, 8 bit chunks or 4 bit chunks,,
   // seems to work best on my gpu with 8 bit chunks
   #define nibbles1_4b   0x0F0F0F0F0F0F0F0F
   #define nibbles2_4b ~nibbles1_4b
   bl  = (al & nibbles1_4b) | ((ah & nibbles1_4b)<<4);
   bh  = (ah & nibbles2_4b) | ((al & nibbles2_4b)>>4);
  
   score += *(nbtab + (uchar) bl);       
   bl >>= 8;               
   score += *(nbtab + (uchar) bh);       
   bh >>= 8;               
   score += *(nbtab + (uchar) bl);       
   bl >>= 8;               
   score += *(nbtab + (uchar) bh);       
   bh >>= 8;               
   score += *(nbtab + (uchar) bl);       
   bl >>= 8;               
   score += *(nbtab + (uchar) bh);       
   bh >>= 8;               
   score += *(nbtab + (uchar) bl);       
   bl >>= 8;               
   score += *(nbtab + (uchar) bh);       
   bh >>= 8;               
   score += *(nbtab + (uchar) bl);       
   bl >>= 8;               
   score += *(nbtab + (uchar) bh);       
   bh >>= 8;               
   score += *(nbtab + (uchar) bl);       
   bl >>= 8;               
   score += *(nbtab + (uchar) bh);       
   bh >>= 8;               
   score += *(nbtab + (uchar) bl);       
   bl >>= 8;               
   score += *(nbtab + (uchar) bh);       
   bh >>= 8;               
   score += *(nbtab + bl);       
   score += *(nbtab + bh);
#endif  
  }
 
  if (score > minscore)
   score = minscore;
  localminscore[locid] = score;
   localminindex[locid] = glbid;
} else {
  localminscore[locid] = gpucalcx->mindistinsu;
  localminindex[locid] = 65000;
  }

// Following based on http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-c...
// Find local memory minimum
barrier(CLK_LOCAL_MEM_FENCE);
int maxp = maxlocid >>1;
while(maxp > 0) {
  if (locid < maxp) {
   if (localminscore[locid] >  localminscore[locid+maxp] ||
     (localminscore[locid] == localminscore[locid+maxp] && localminindex[locid] > localminindex[locid+maxp])) {
    localminscore[locid] = localminscore[locid+maxp];
    localminindex[locid] = localminindex[locid+maxp];
   }
  }
  maxp = maxp >> 1;
  barrier(CLK_LOCAL_MEM_FENCE);
}

// Store results for local minimum to global memory
if (locid == 0) {
  int grpid = get_group_id(0);
   p1results[(grpid<<1)+0] = localminscore[0];
  p1results[(grpid<<1)+1] = localminindex[0];
}
  barrier(CLK_GLOBAL_MEM_FENCE);

// Find minimum of all local minimums (CPU would be faster at this but this is tidier)
  if (glbid == 0) {
  int numgroups = get_num_groups(0) << 1;
  glindex = 2;
  while (glindex < numgroups) {
   if (p1results[0] > p1results[glindex] ||
     (p1results[0] == p1results[glindex] && p1results[1] > p1results[glindex+1])) {
      p1results[0] = p1results[glindex];
      p1results[1] = p1results[glindex+1];
    }
    glindex += 2;
  }
}
return;
}

:

0 Likes

Hi,

With things like these:

  score += *(nbtab + (uchar) bl);       

   bl >>= 8;               

   score += *(nbtab + (uchar) bh);       

   bh >>= 8;

you're killing the memory interface of the card. It is designed to read 2048 bit large aligned chunks of memory, but your code reads lots of 8bit unaligned values.

But luckily the tables are very small, so my advice would be:

- in the start, copy your tables into local memory (LDS).

- also as i'm not sure right now, that there are dedicated 16bit LDS instructions, you can widen the table elements to 32bit for easier access.

- shifting 64 bit value is slower than just masking out adjacent bits from 32 bit integers. There are AMD exclusive extensions for this in OpenCL.

- use 64 as workgroupsize, so the local barriers can be eliminated.

- as now in the beginning of the kernel you have to copy tables into local memory takes lots of time, try to make your outer loop as long as possible.

- 78000 kernel launches are so bad. Try to launch exactly no_of_streams_in_your_card*4 (or integer multiples of this) kernels only, and make long work queues inside them. This requires that you make every thread has same exec times. For max alu utilization you'll need long kernels that runs over 100ms time. Unfortunately your algo is so memory hungry, that it can't go near max alu utilization but with LDS it should outperform the cpu implementation many times.

Good luck!

0 Likes

Hi realhet,

A lot of food for thought there and a lot I don't yet understand.

You mention making the outer loop as long as possible, there is only one trivial loop in the kernel which could be unrolled.

Do you mean I should make one so that a single kernel thread processes multiple work items ?

I can see how this would be useful : use of local memory would give more benefits.

Not sure how to do this though, by default I get as many threads as work items, so there is no loop.

Copying say nbtab (the most used table) to local memory did not seem to help, nor did just defining the array nbtab with

all its values inside the kernel. May work a lot better if the thread processed multiple work items.

I did try & process multiple work items in a kernel, though kept getting wrong results, not sure why.

Do you know of any published method for doing it ?

Cheers.


0 Likes

Hi,

"Do you mean I should make one so that a single kernel thread processes multiple work items ?"

You said earlier that "The kernel gets called around 78000 times"

Is it means that you use clEnqueueKernelNDRange() 78000 times? If so, that thousands of kernel launches has a very big penalty on GPU. (However no penalty on CPU because the data is already resides on the cpu)


Using local memory would help this way:

Now with global memory, every workitems use the same memory controller concurrently. This is the bottleneck thay you interleave almost every alu operation with a memory operation. On GPU the memory is so slow compared to the ALU even it's 320GByte/sec on the best hardware, there are so many ALU that it could not catch up with.

Let's say on a Tahiti device you have 8192 workitems running and they're assigned evenly to the 32 Compute Units.

Each compute units can access its own 64KB local memory (LDS).

So instead of 8192 threads would access a single memory interface (If there's an L1 cache miss), it is much better that you first copy the data into LDS and then every 256 workitems can use it's own LDS unit in the same time.

And finally because of the copy operation takes a while, you should reuse that copied table (with making a loop around your code which actually uses the table and do more tasks with the same table).

0 Likes

Hi realhet,

Yes clEnqueueKernelNDRange() gets called 78000 times for the set of files I am using to test/benchmark the kernel.

The GPU is competing with the CPU that can execute the kernel equivalent once every 44 microseconds.

There are around 3000 work items in this case, there is another variant which I have yet to write where the kernel wold

be required to process roughly 7000 work items,

I am aware that this kernel is too small (it does not do enough work) to really benefit from a GPU. As mentioned earlier

there are a couple of strategies (using the new features of Kaveri and/or moving a lot of non parallelisable code from the

CPU to the GPU). Unfortunately it really is sequential only code, I am going to have to wrap the code up in a

if (get_global_id(0) == 0) statement.The win might come from much less kernel calls, perhaps 1% of the current count.

Some sample timings for this are :

Execute on CPU, cpu time 3.625 secs

A null kernel (simple return on entry) gpu time 20.98 secs

Best GPU time so far using nbtab in global memory 23.5 secs.

Copying nbtab to local memory, worse than using global memory.

Defining nbtab to be an array of values predefined in the kernel, worse than using global memory..

Make nbtab smaller with an array of 16 values predefined in the kernel (almost 2* number of statements though), worse than using global memory.,

Changing workgroup size to 64, gpu time 27 secs

Your suggestion of only having as many threads as there are stream processors sounds like a good one :

process a few work items per thread allowing more efficient use of local and private memory.

Have not got this to work, yet however,

In fact nothing is working at the moment :-(, Upgraded my card to an R9-270x and when installing catalyst v14.4

I am told I must be a member of the administrator group, when I already am !

There is a possibility my old card is duff.

Regards

Steve


0 Likes

Sure that 78K kernel launches is the first and biggest bottleneck.

This problem must be solved first.

And after you can select proper workItem counts:

WorkItems/WorkGroup -> must be 64 or 128 or 192 or 256.

- If the workgroupsize is 1 it will take the same time as when workgorupsize is 64.(In case if you are at peak ALU utilization.)

- Unless you have special needs I strongly suggest 64.

R9 270X has 1280 stream units. Thus the minimum streams required to be able to reach maximum ALU performance is 1280*4 = 5120 workitems.

Scalability is simliar here to: 5121 workitems will take 2x much time as 5120 workitems. (In case if you are at peak ALU utilization.)

But first that 78K kernel launches must be solved: it takes 20secs out of 23 secs. So I'd try to collect the input data in those test files you have, and process them in batch with one or a few long kernel launches. And probably this way it can reach 3 sec (unless some other bottleneck raises ).

0 Likes

Hi,

Got the new video card working finally. Results are not materially different.

Agreed the 78k kernel call overhead is the biggest problem. Unfortunately I cannot batch process them without porting 1000s of lines of code into the kernel.

The results of one call changes the data (my GPUCALCX structure) for the next call ,

Don't know whether you are aware but this is from the HSA development Beta :

"Platform Atomics provides memory consistency for loads and stores in the host program and the compute kernel.   The host and device can atomically operate on the same memory locations, and also have control over memory visibility.   Platform atomics provide new instruction-level synchronization between the CPU and the HSA CUs– in addition to the coarse-grained “command-level” synchronization that OpenCL™ has traditionally provided.

For example, platform atomics enable the device and host to participate in “producer-consumer” style algorithms, without requiring the launch of a new kernel.  The kernel running on the device can produce a piece of data, use the platform atomic operations to set a flag or enqueue the data to a platform scope, and the host can see the produced data – all while the compute kernel continues to run.   Likewise, the CPU can act as the producer, sending work to a running kernel which is waiting for that data to arrive.  Essentially platform atomics allow communication between host and device without having to relaunch new kernels from the host – this can be significantly faster, and also can result in a more natural coding style.   Platform Atomics also helps in writing lock/wait-free data structures that can scale across the platform."

This sounds ideal for this problem. Essentially enqueue the kernel once only and use atomics to synchronise providing new GPUCALCX data for the kernel and for collecting the results.

Now that I understand I am not making stupid errors with the kernel (other than being too small), I can progress to trying out the new Kaveri features and find out if that can give me the

required performance. This will take a while though. If it's of interest to anyone I will report back when I have some Kaveri results.

Thanks to all who have offered suggestions. I am a little wiser.

Regards

Steve

0 Likes