cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

empol
Journeyman III

Oil painting kernel on CPU - rubish output

I wrote kernel for oil painting filter. Works for GPU, but for CPU i get messed up output. I can see shapes and colors, but it's very noisy.

 

 

__kernel void oil(__global uchar4 * in, __global uchar4 * out, const int brushsize) { const int radius = brushsize >> 1; int sizeX = get_global_size(0); int sizeY = get_global_size(1); int x = get_global_id(0); int y = get_global_id(1); uchar intensity, maxIntensity = 0, maks = 0; uchar4 intensities[256]; uint4 arrays[256]; float4 params = (float4) (0.2125f, 0.7154f, 0.0721f, 0.0f), temp; int t; uchar4 p; uint Index = y * sizeX + x; intensities[0].x = '\0'; arrays[0] = '\0'; for (int i = -radius; i <= radius; i++ ) { t = y + i; if (t < 0) continue; if (t >= sizeY) break; for (int j = -radius; j <= radius; j++) { t = x + j; if (t < 0) continue; if (t < sizeX) { p = (uchar4) in[Index + i * sizeX + j]; temp = params * convert_float4(p); intensity = (uchar) temp.x + temp.y + temp.z; intensities[intensity].x++; /*if(intensities[intensity].x > maks) { maxIntensity = intensity; maks = intensities[intensity].x; }*/ arrays[intensity] += (uint4) (p); } } } maxIntensity = 0; maks = 0; for (int i = 0; i < 256; i++ ) { if ( intensities.x > maks ) { maxIntensity = (uchar) i; maks = intensities.x; } } uint4 inttemp = (uint4) (intensities[maxIntensity].x, intensities[maxIntensity].x, intensities[maxIntensity].x, 1); out[Index] = (uchar4) (arrays[maxIntensity] / inttemp); }

0 Likes
15 Replies
omkaranathan
Adept I

Please post the complete code(a compilable test case). Its easy to track down and fix the problem that way.

0 Likes

I posted complete kernel code. No problem with host code for sure.

But problem updated. I found bug and fixed it. Bu it's just a workaround.

 

Problem was with this line:

uint4 arrays[256];

Without init with zeros code works only for GPU. When i change it to:

uint4 arrays[256] = {0, 0, 0, 0};

code works for cpu, bu not for gp (black output).

 

Now array is zeroed in loop. But this is slowdown. Big problem because code is already slow (only about 8 x faster on gpu than serial code).

__kernel void oil(__global uchar4 * in, __global uchar4 * out, const int radius) { int sizeX = get_global_size(0); int sizeY = get_global_size(1); int x = get_global_id(0); int y = get_global_id(1); uint intensity = 0 , maxIntensity = 0, j1 = 0; uint intensities[256] = {0}; uint4 arrays[256];// = {0, 0, 0, 0}; float4 params = (float4) (0.2125f, 0.7154f, 0.0721f, 0.0f), temp; int t; uchar4 p; uint Index = y * sizeX + x; for(int i = 0; i < 256; i++) { arrays.x = '0'; arrays.y = '0'; arrays.z = '0'; } for (int i = -radius; i <= radius; i++ ) { t = y + i; if (t < 0) continue; if (t >= sizeY) break; for (int j = -radius; j <= radius; j++) { t = x + j; if (t < 0) continue; if (t < sizeX) { p = (uchar4) in[Index + i * sizeX + j]; temp = params * (float4) (p); intensity = temp.x + temp.y + temp.z; intensities[intensity]++; int q = (isgreater(intensities[intensity], j1)); if (q == 1) { maxIntensity = intensity; j1 = intensities[intensity]; } arrays[intensity] += (uint4) (p); } } } uint4 inttemp = (uint4) (intensities[maxIntensity], intensities[maxIntensity], intensities[maxIntensity], 1); out[Index] = convert_uchar4_sat_rte(arrays[maxIntensity] / inttemp); }

0 Likes

Outside of memory issues, it looks like you have a ton of control flow and a ton of dependency. Both of which are going to impede performance.

0 Likes

So how to optimize this algorithm?

 

I can change for... loops to if..., but only for radius = 3. raduis = 4 and i have 81 if...  (!). Stream KernelAnalyzer 1.5 says it would be much worse. compared to serial code it's 8 - 14 x faster (depends on image size and kernel workgroup size).

0 Likes

Can you put your local arrays into local memory instead of using scratch memory, something like this, does this work?

It would be easier to help if you posted all the code.

I'm not sure if this will verfiy, but you get much better results in SKA.

BTW, which card are you using? 48xx or 58xx. I'm asking because of the local memory.

__kernel void oil(__global uchar4 * in, __global uchar4 * out, const int radius) { int sizeX = get_global_size(0); int sizeY = get_global_size(1); int x = get_global_id(0); int y = get_global_id(1); uint intensity = 0 , maxIntensity = 0, j1 = 0; __local uint intensities[256];// = {0}; __local uint4 arrays[256];// = {0, 0, 0, 0}; float4 params = (float4) (0.2125f, 0.7154f, 0.0721f, 0.0f), temp; int t; uchar4 p; uint Index = y * sizeX + x; for(int i = 0; i < 256; i++) { arrays.x = '0'; arrays.y = '0'; arrays.z = '0'; intensities = '0'; } for (int i = -radius; i <= radius; i++ ) { t = y + i; if (t < 0) continue; if (t >= sizeY) break; for (int j = -radius; j <= radius; j++) { t = x + j; if (t < 0) continue; if (t < sizeX) { p = (uchar4) in[Index + i * sizeX + j]; temp = params * (float4) (p); intensity = temp.x + temp.y + temp.z; intensities[intensity]++; int q = (isgreater(intensities[intensity], j1)); if (q == 1) { maxIntensity = intensity; j1 = intensities[intensity]; } arrays[intensity] += (uint4) (p); } } } uint4 inttemp = (uint4) (intensities[maxIntensity], intensities[maxIntensity], intensities[maxIntensity], 1); out[Index] = convert_uchar4_sat_rte(arrays[maxIntensity] / inttemp); }

0 Likes

I've got 4850, so can't use __local.

 

here is my project:

http://empol.x.pl/vid/mgr.rar

0 Likes

Well then, the only obvious things I can suggest are to reduce control flow and try doing some unrolling if possible.

0 Likes

Unrolled kernel is about 23% sloer for radius 5. As i thought. Unrolled but much more if's.

Can't remove if's because in this algorithm every pixel of mask can be outside of input image boundary - incorrect result and memory acces error.

But maybe anyone know universal way to zero arrays for GPU and CPU without loop. It would be a little faster.

0 Likes

The ifs should be removed by replacing with select(), or simple one statement ifs or software predication.

Though, I doubt that it's going to improve performance much since you are global write bound.

0 Likes

Here is my unrolled code for 1 pixel.

Can you post sample of better code for 1 pixel?

if((y + (-5) >= 0) && (y + (-5) < sizeY) && (x + (-5) >= 0) && (x + (-5) < sizeX)) { p = (uchar4) in[Index + (-5 * sizeX) + ( -5)]; temp = params * (float4) (p); intensity = (uint) (temp.x + temp.y + temp.z); intensities[intensity]++; arrays[intensity] += (uint4) (p); }

0 Likes

1. Why are you doing 1 pixel at a time? Why not 4?

2. AMD's compiler/ISA doesn't allow ALU Packing across control flow. So you can take each statement and apply the select() function to it to theoritically (not sure if it actually will) get better packing. You can find the "select()" function in the OpenCL spec.

0 Likes

1) How? uchar16?

2) I don't get your idea.

 

Can u give me piece of code please?

0 Likes

ryta1203 he should optimize AFTER that it got working.

you should avoid if() and use select function instead because it is translated into special ALU instruction so you get better performance as it do not need switch to ControlFlow clausule which cause performace drop on GPU.

0 Likes
empol
Journeyman III

Originally posted by: nou ryta1203 he should optimize AFTER that it got working.


Code works.

I know select() is better but how use it in this case. Maybe i'll understand it when i see code.

0 Likes

The OpenCL spec shows exactly how to use the select function.

Yes, if it doesn't work you should get it to work first, but like you said, it works.

You could use uchar16, but I would try uchar2 first, then goto uchar4, etc...

the AMD GPU uses a 5-wide VLIW processsor so it's not like you are going to get 16.

Also, since you are using a 48xx and can't use local memory, you are seriously write bound anyways.

 

0 Likes