OpenCL

vit · ‎07-10-2018

(Sorry for my English)

There is a statistical function:

int tst(global int* x, int bar,

int lo, int hi,

int n, bool flag)

{

int _bar = bar - n + 1,

S1 = 0,

NPlus1 = n + 1,

i = 2,

i_ = n >> 1 << 1;

do

{

if (x[_bar] > x[bar])

S1 += NPlus1 - i;

if (i == i_)

break;

bar--;

_bar++;

i += 2;

}

while (true);

/*process S1...*/

return 0;

}

remove NPlus1 and modify the function:

1) i = 1 [old i = 2]

2) i_ = (n >> 1 << 1) - 1 [old i_ = n >> 1 << 1]

3) S1 += n - i [old S1 += NPlus1 - i]

The new modification works at 15 - 20% slower ! But nothing fundamentally changed...

What is the reason ?

Radeon R7 360

Driver ver. 23.20.15033.5003

vit · ‎07-17-2018

I found the reason for the performance change is + -20%. The reason is that I used too much memory from the value returned by the function clGetDeviceInfo with the CL_DEVICE_MAX_MEM_ALLOC_SIZE flag. I greatly reduced the use of memory and now when adding / removing 1 local variable, the program's performance became adequate.
I'm sorry!
But still there were confusing moments.

As I understand, this function returns the amount of memory available in the video card, and which can be allocated for 1 call to the clCreateBuffer, but some of this memory is still used for system needs (possibly stacks, etc.)
How to find out the maximum amount of memory that can be allocated for program data so that it does not conflict with memory for system needs and there was probably no access to the host's memory (computer) if there is not enough memory in the video card?

View solution in original post

elstaci · ‎07-10-2018

Since this is an OpenCL question, try posting this question here : OpenCL . (Deepak )

dipak · ‎07-10-2018

Hi Vit,

I've whitelisted you. I'm moving this thread to OpenCL forum.

Regards,

dipak · ‎07-11-2018

The new modification works at 15 - 20% slower ! But nothing fundamentally changed...What is the reason ?

It seems modified code generates a different set of instructions that impacts the performance. Please analyze the kernel with CodeXL and compare the two ISA or IL codes. It will help you to identify the reason for this difference.

vit · ‎07-11-2018

At the moment I do not use CodeXL. I modified the function gradually from the old view to the new one and found out that the performance drop occurs when I delete the expression 'NPlus1 = n + 1' explicitly or this is done by the optimizer since this expression is no longer used.

The reason for the performance drop is that the local variable stored in the stack or in the register was deleted ...

If I analyze ISA or IL code, will I be able to correct the situation using only the C language constructs ?

dipak · ‎07-13-2018

Generally, you don't expect one-to-one mapping between the source code and the ISA/IL code, especially when optimization is enabled. Analyzing the low-level code needs some expertise and understanding about the instruction set. However, you can see the set of instructions generated against a OpenCL kernel/function.

Earlier what I meant to say that, if you compare the two IL/ISA codes and other resource counters, you may get an idea why the performance difference is. A small change in source code can produce a significant difference in ISA. So addition to the resource usage such as registers, it may also impact the performance.

vit · ‎07-13-2018

I understand this. From the point of view of an ordinary C programmer, this situation is abnormal - a small, insignificant change in the program gives a huge drop in productivity.
In the programming language compiler, this is unacceptable.

I guess (perhaps erroneously, but the facts show) that in the AMD OpenCL C compiler there are serious problems in optimizing / locating local variables.

Even if I find the reason for the performance drop in assembler code, I can hardly compensate for this performance drop using only the C language.

For certain it is necessary to use the assembler / processor code, but I do not want to write the program in assembler / processor code.

And in addition I found 3-4 strange moments, one of them:

int open (...)
{
int open = ... // the local variable is also called as the function
}
If you use -O0, the build error (link !) occurs and my process (.exe) crashes; With optimization flags, everything goes fine.

Thanks for answers !

dipak · ‎07-16-2018

From the point of view of an ordinary C programmer, this situation is abnormal - a small, insignificant change in the program gives a huge drop in productivity.
In the programming language compiler, this is unacceptable.

GPU programming is quite different than general CPU programming. For example, an optimization trick commonly used in CPU programming may cause a performance impact on GPUs.

If you use -O0, the build error (link !) occurs and my process (.exe) crashes; With optimization flags, everything goes fine.

I suspect, when optimization is enabled, the function call might have been replaced by inline code and no separate code against the function was generated. Hence there was no link problem. Whereas, without optimization, it's a normal function call which caused the build error.

vit · ‎07-17-2018

I found the reason for the performance change is + -20%. The reason is that I used too much memory from the value returned by the function clGetDeviceInfo with the CL_DEVICE_MAX_MEM_ALLOC_SIZE flag. I greatly reduced the use of memory and now when adding / removing 1 local variable, the program's performance became adequate.
I'm sorry!
But still there were confusing moments.

As I understand, this function returns the amount of memory available in the video card, and which can be allocated for 1 call to the clCreateBuffer, but some of this memory is still used for system needs (possibly stacks, etc.)
How to find out the maximum amount of memory that can be allocated for program data so that it does not conflict with memory for system needs and there was probably no access to the host's memory (computer) if there is not enough memory in the video card?

OpenCL

OpenCL: Does optimization slow?