Archives Discussions

qqchose · ‎03-25-2013

Hi

I have questions about optimization in OpenCL. I have a kernel. What this kernel does is not interesting for my questions. Then take this small example

kernel void myKernel(...)

{

output = result;

}

My kernel needs 89.032 ms to complete. I need to call my kernel many times. Then, I thought it should be faster to loop X times in my kernel instead to call my kernel X times. Then, I tried this

kernel void myKernel(...)

{

for(int i = 0; i < param.m_nbSample; ++i)

{

output += result;

}

I Tried to set param.m_nbSample egal to “1” to be sur evertything is fine. Everything is fine, but, my kernel need 224.633 ms to complete. 252% slower. Not what I expected. I tried to change m_nbSampe to see the result.

If m_nbSample egal 2, each sample need 224.719 ms by sample (loop)

If m_nbSample egal 4, each sample need 244.885 ms by sample (loop)

If m_nbSample egal 50, each sample need 220.442 ms by sample (loop)

I try to harcode the loop and use unroll

kernel void myKernel(...)

{

#pragma unroll 1

for(int i = 0; i < 8; ++i)

{

output += result;

}

Each sample needs 243.236 ms by sample (loop).

It’s is normal to have this huge difference ? it’s a lot better to call my kernel X times instead to loop X times. What can explain this?

I tried something else.

I move my entire kernel in a function like this

void myKernelFunction(...)

{

output += result;

}

kernel void myKernel(...)

{

myKernelFunction(...)

}

Now my kernel need 93.7073 ms. I thought all function was “inlined” in kernel. Then I don’t understand why I already lost 4 ms just by adding a function. I tested and everything was fine. Then now I can test what happen if I call myKernelFunction X times. I tried 2 times.

kernel void myKernel(...)

{

myKernelFunction(...)

}

The result: ... 972.904 ms by sample ( then 1945.81 ms for the entire kernel). 10 times slower! Worst then before with the “for”. What can cause this? What is supposed to be the best if I want to call my kernel X times?

Thanks

realhet · ‎03-25-2013

- 50-50 IO/ALU ratio is really bad. 1-30 would be ok. (in dwords)

- Global size (1006, 806) is 810k workitems, it seems ok, but (local size 6, 32) sounds a bit weird to me (maybe just to me, I always use 64 or 128 or 256 for local, they're 100% safe). Hope the hardware can make that out from 3 wavefronts.

- 559 ScratchRegs -> It must be terribly slow On a 69xx you should make your kernel to use at most 128 registers. But this uses 128+559 in total, you have to reorganize your

 somehow. ScratchReg usage means really slow and concurrent RAM access instead of fast registers...

- 21442 lines of VLIW disasm is 21442*8=160K bytes (without literal constants), the problem is that the GPU only has 48KB of instruction cache. This one adds to the Scratch Reg slowness.

I can't imagine what can be your

 but it's way too big in many aspects for a single stream processor to handle. Try to split the job to fit in above limits. It's not bad if you have 100million workitems, but stay under 128 dword regs and 40KB of code, and watch out for ALU/MEM instruction ratio.

View solution in original post

realhet · ‎03-25-2013

Hi!

How many instructions are in your

? Are there mostly ALU insts, or lots of memory IO?

How many workitems per kernel you launch each time?

On what hardware?

2x slower -> maybe the workitem count is not that good (for the given hardware) when you make an inner loop and decrease the workitemcnt.

10x slower ->

- did you accidentally forced the compiler to run out of registers(scratch regs or memory is terrible slow)?

- Does the code fit into the instruction cache?

You can check those last two in the disasm.

qqchose · ‎03-25-2013

Thank you for this information.

My instruction is IO and ALU. There are maybe 50-50.

my global size is (1006, 806) and my local size is (6, 32)

My hardware is AMD Radeon HD 6900 Series
Catalyst 13.3

I will try to change my local size to see if I can see a difference, thanks

I don't know if I run out of register,

But in AMD APP KernelAnalyser2, I see 559 ScracthRegs. Does it mean I run out of register?

I don't know if my code fit in the instruction cache, my kernel need 21442 line in the dissembly and the .bin is 1242Ko. Then it's probably the case.

I will check this thanks.

realhet · ‎03-25-2013

- 50-50 IO/ALU ratio is really bad. 1-30 would be ok. (in dwords)

- Global size (1006, 806) is 810k workitems, it seems ok, but (local size 6, 32) sounds a bit weird to me (maybe just to me, I always use 64 or 128 or 256 for local, they're 100% safe). Hope the hardware can make that out from 3 wavefronts.

- 559 ScratchRegs -> It must be terribly slow On a 69xx you should make your kernel to use at most 128 registers. But this uses 128+559 in total, you have to reorganize your

 somehow. ScratchReg usage means really slow and concurrent RAM access instead of fast registers...

- 21442 lines of VLIW disasm is 21442*8=160K bytes (without literal constants), the problem is that the GPU only has 48KB of instruction cache. This one adds to the Scratch Reg slowness.

I can't imagine what can be your

 but it's way too big in many aspects for a single stream processor to handle. Try to split the job to fit in above limits. It's not bad if you have 100million workitems, but stay under 128 dword regs and 40KB of code, and watch out for ALU/MEM instruction ratio.

qqchose · ‎03-26-2013

I asked a colleague why we has this local size. He answer those value was optimize for NVIDIA. I did not check if he is right, but on ATI, those value are the wrost. I check with CodeXL and it's hard to find a wrost case. I change it to 256 and it's a lot better.

I found my VGPRs are the bottleneck. Thanks, II will try to optimise them and I will try to split this kernel into smaller kernel.

I'm not allow to tell what is inside

 but thanks, I understand cleared what happen and I can optimise our kernel

realhet · ‎03-26-2013

One more thing: I saw your exec times are a bit like random.

You can have more precise exec times if you first warm up the GPU with a little dummy kernel, and then immediatelly launch your kernel.

For example launch your kernel twice and only take the second time!

It could be ok, when the whole process is completed in 0.5s.

And when you optimizing only for the last 1-2% then it will need 10..30 seconds long tests, to have more precise times.

qqchose · ‎03-26-2013

Thanks,

Right now, I run my kernel 50 times and I do an average. But I notice the firsts iterations are always longer. I didn't know why. Now, I will ignore some iteration when I compute the average, thanks.

realhet · ‎03-26-2013

...50 times. Oh that's perfect then.

"I didn't know why."

When the card does nothing serious for a short period if time, it will reduce clock speed down to 150..300 MHz to save power.

When you give it enough work to do, it will raise the clock cycles automatically.

This behaviour is controlled only a few times per every second, so the first kernel will be always laggy.

(On HD6970 and HD6990 there is a "Power Control Setting" slider in Catalyst Control Center/Overclk. If you overclock the card, this could also modify the clock rate when it thinks it uses too much power.)

dmeiser · ‎03-25-2013

You might also consider accumulating your results in registers, i.e. introduce a __private variable resultP that you increment each time through your loop and then write the result back to global memory once your done with your loop. Right now it seems to me that you're reading from and writing to global memory each time through your loop.

qqchose · ‎03-26-2013

Thanks, this code was just an example. I wanted to have the minim code here to reduce your time to read it. But thanks, I will check to be sure we do this.

himanshu_gautam · ‎03-25-2013

qqchose,

Since all threads are reading and writing output and since "i" is not a funtion of get_global_id(0) -- All threads are reading and writing into same variable...

This is bad on the GPU buddy....

Cache trash......, Inconsistent results... etc...

That is probably why you are seeing performance numbers that you dont want to see..

qqchose · ‎03-26-2013

I did not writen, but

i = get_global_id(0) [...] ;

what happen to "i" was include in

. But thanks, I will check to be sure nobody to this kind of mistake. Seam obious, but we are not invulnerable.

--EDIT--

I undestand the confusion now. There is a mistake in my example, but not in my real code. I wrote

output = value;

"i" is something like

i = get_global_id(0)

After, I added

for(int i ...)

In my example, but there is a mistake there because "i" is already use. In my real code, it look like

[CODE INIT]

int index = get_global_id(0) [...]

for(int iSample = 0; iSample < [...]; ++iSample )

{

[CODE LOOP1]

value = [...]

[CODE LOOP2]

output[index] = value

}

sorry for the confusion

Archives Discussions

Questions about performance