15 Replies Latest reply on May 19, 2010 5:28 PM by ryta1203

    Oil painting kernel on CPU - rubish output

    empol

      I wrote kernel for oil painting filter. Works for GPU, but for CPU i get messed up output. I can see shapes and colors, but it's very noisy.

       

       

      __kernel void oil(__global uchar4 * in, __global uchar4 * out, const int brushsize) { const int radius = brushsize >> 1; int sizeX = get_global_size(0); int sizeY = get_global_size(1); int x = get_global_id(0); int y = get_global_id(1); uchar intensity, maxIntensity = 0, maks = 0; uchar4 intensities[256]; uint4 arrays[256]; float4 params = (float4) (0.2125f, 0.7154f, 0.0721f, 0.0f), temp; int t; uchar4 p; uint Index = y * sizeX + x; intensities[0].x = '\0'; arrays[0] = '\0'; for (int i = -radius; i <= radius; i++ ) { t = y + i; if (t < 0) continue; if (t >= sizeY) break; for (int j = -radius; j <= radius; j++) { t = x + j; if (t < 0) continue; if (t < sizeX) { p = (uchar4) in[Index + i * sizeX + j]; temp = params * convert_float4(p); intensity = (uchar) temp.x + temp.y + temp.z; intensities[intensity].x++; /*if(intensities[intensity].x > maks) { maxIntensity = intensity; maks = intensities[intensity].x; }*/ arrays[intensity] += (uint4) (p); } } } maxIntensity = 0; maks = 0; for (int i = 0; i < 256; i++ ) { if ( intensities[i].x > maks ) { maxIntensity = (uchar) i; maks = intensities[i].x; } } uint4 inttemp = (uint4) (intensities[maxIntensity].x, intensities[maxIntensity].x, intensities[maxIntensity].x, 1); out[Index] = (uchar4) (arrays[maxIntensity] / inttemp); }

        • Oil painting kernel on CPU - rubish output
          omkaranathan

          Please post the complete code(a compilable test case). Its easy to track down and fix the problem that way.

            • Oil painting kernel on CPU - rubish output
              empol

              I posted complete kernel code. No problem with host code for sure.

              But problem updated. I found bug and fixed it. Bu it's just a workaround.

               

              Problem was with this line:

              uint4 arrays[256];

              Without init with zeros code works only for GPU. When i change it to:

              uint4 arrays[256] = {0, 0, 0, 0};

              code works for cpu, bu not for gp (black output).

               

              Now array is zeroed in loop. But this is slowdown. Big problem because code is already slow (only about 8 x faster on gpu than serial code).

              __kernel void oil(__global uchar4 * in, __global uchar4 * out, const int radius) { int sizeX = get_global_size(0); int sizeY = get_global_size(1); int x = get_global_id(0); int y = get_global_id(1); uint intensity = 0 , maxIntensity = 0, j1 = 0; uint intensities[256] = {0}; uint4 arrays[256];// = {0, 0, 0, 0}; float4 params = (float4) (0.2125f, 0.7154f, 0.0721f, 0.0f), temp; int t; uchar4 p; uint Index = y * sizeX + x; for(int i = 0; i < 256; i++) { arrays[i].x = '0'; arrays[i].y = '0'; arrays[i].z = '0'; } for (int i = -radius; i <= radius; i++ ) { t = y + i; if (t < 0) continue; if (t >= sizeY) break; for (int j = -radius; j <= radius; j++) { t = x + j; if (t < 0) continue; if (t < sizeX) { p = (uchar4) in[Index + i * sizeX + j]; temp = params * (float4) (p); intensity = temp.x + temp.y + temp.z; intensities[intensity]++; int q = (isgreater(intensities[intensity], j1)); if (q == 1) { maxIntensity = intensity; j1 = intensities[intensity]; } arrays[intensity] += (uint4) (p); } } } uint4 inttemp = (uint4) (intensities[maxIntensity], intensities[maxIntensity], intensities[maxIntensity], 1); out[Index] = convert_uchar4_sat_rte(arrays[maxIntensity] / inttemp); }

                • Oil painting kernel on CPU - rubish output
                  ryta1203

                  Outside of memory issues, it looks like you have a ton of control flow and a ton of dependency. Both of which are going to impede performance.

                    • Oil painting kernel on CPU - rubish output
                      empol

                      So how to optimize this algorithm?

                       

                      I can change for... loops to if..., but only for radius = 3. raduis = 4 and i have 81 if...  (!). Stream KernelAnalyzer 1.5 says it would be much worse. compared to serial code it's 8 - 14 x faster (depends on image size and kernel workgroup size).

                        • Oil painting kernel on CPU - rubish output
                          ryta1203

                          Can you put your local arrays into local memory instead of using scratch memory, something like this, does this work?

                          It would be easier to help if you posted all the code.

                          I'm not sure if this will verfiy, but you get much better results in SKA.

                          BTW, which card are you using? 48xx or 58xx. I'm asking because of the local memory.

                          __kernel void oil(__global uchar4 * in, __global uchar4 * out, const int radius) { int sizeX = get_global_size(0); int sizeY = get_global_size(1); int x = get_global_id(0); int y = get_global_id(1); uint intensity = 0 , maxIntensity = 0, j1 = 0; __local uint intensities[256];// = {0}; __local uint4 arrays[256];// = {0, 0, 0, 0}; float4 params = (float4) (0.2125f, 0.7154f, 0.0721f, 0.0f), temp; int t; uchar4 p; uint Index = y * sizeX + x; for(int i = 0; i < 256; i++) { arrays[i].x = '0'; arrays[i].y = '0'; arrays[i].z = '0'; intensities[i] = '0'; } for (int i = -radius; i <= radius; i++ ) { t = y + i; if (t < 0) continue; if (t >= sizeY) break; for (int j = -radius; j <= radius; j++) { t = x + j; if (t < 0) continue; if (t < sizeX) { p = (uchar4) in[Index + i * sizeX + j]; temp = params * (float4) (p); intensity = temp.x + temp.y + temp.z; intensities[intensity]++; int q = (isgreater(intensities[intensity], j1)); if (q == 1) { maxIntensity = intensity; j1 = intensities[intensity]; } arrays[intensity] += (uint4) (p); } } } uint4 inttemp = (uint4) (intensities[maxIntensity], intensities[maxIntensity], intensities[maxIntensity], 1); out[Index] = convert_uchar4_sat_rte(arrays[maxIntensity] / inttemp); }