8 Replies Latest reply on Jan 19, 2011 2:48 PM by MicahVillmow

    Fast Reed-Muller Transform in OpenCL

    marblecanyon
      How to work with boolean arrays in OpenCL kernels

      I am working on OpenCL implementions of various fast Kronecker transforms (Walsh-Hadamard, arithmetic, Reed-Muller...). I had no problems when writing kernels for transforms that work with integer arrays (Walsh-Hadamard and arithmetic), but I have the following problem with implementation of Reed-Muller transform, for which I'm asking for someone's help and advice:

      This transform should operate on an array of bool values by doing an XOR operation on certain pairs of bool values, as you can see in the attached code. When I create the first parameter to the kernel as a __global bool array, the transform doesn't return correct values, and when I make this parameter a __global char array I get a 4x decrease in performance, but correct result. My guess is that there is some problem with __global bool arrays in OpenCL, but I'm not sure exactly what.   

      Thank you in advance for your time and help!

       

      // kernel code __kernel void FastReedMullerTransform(__global bool *numarray, __const unsigned int step) { unsigned int tid = get_global_id(0); const unsigned int pair = tid%step + 2*step*(tid/step); const unsigned int match = pair + step; const bool u = numarray[pair]; const bool v = numarray[pair + step]; numarray[match] = u ^ v; } //referent C++ implementation template<class T> void CPUFastReedMullerTransform(T *inputArray, const unsigned int n) { { // for each pass of the algorithm for(cl_uint step=1; step < n; step <<=1) { cl_uint jump = step << 1; for(cl_uint pair = 0; pair < n; pair += jump) { cl_uint t1=pair; cl_uint t2=pair+step; for(cl_uint i = 0; i < step; ++i) { bool u = inputArray[t1]; bool v = inputArray[t2]; inputArray[t2] = u ^ v; ++t1, ++t2; } } } } return; }

        • Fast Reed-Muller Transform in OpenCL
          MicahVillmow
          bool is not a valid kernel argument type.
          See section 6.8.k of the spec.
          "Arguments to __kernel functions in a program cannot be declared with the built-in
          scalar types bool, half, size_t, ptrdiff_t, intptr_t, and uintptr_t. The"
            • Fast Reed-Muller Transform in OpenCL
              himanshu.gautam

              marblecanyon,

              I am not familiar with the access pattern required for Reed Muller Transform, but i think if we are able to assign more work per work item we sometimes can get better performance. Maybe you can use a char to access 8 bools and do 8 transforms inside a single workitem.

              This helps in many situation and may be reed muller can also benifit. 

              • Fast Reed-Muller Transform in OpenCL
                mvse

                 

                Hello,

                 

                I actually use bool as kernel arguments for read and write operations. So I am a little bit surprised that it should not work. Most likely I make something wrong, but the code works and gives correct results. Perhaps the performance is not good.

                 

                I believe to remember that I avoid incorrect results by using sizeof(cl_bool) to determine the size of the buffer in the clCreateBuffer function.

                bool_buffer = clCreateBuffer( context, CL_MEM_READ_WRITE, sizeof(cl_bool) * number_of_items, NULL, &err); //example kernel __kernel void BoolResult( const __global bool *status, const __global double *x1, const __global double *x2, __global double *y) { index = get_global_id(0); if(status[index]) { y[index] = x1[index]; } else { y[index] = x2[index]; } }

              • Fast Reed-Muller Transform in OpenCL
                MicahVillmow
                mvse, I would not rely on this behavior as it violates the spec and can be disabled in the future.
                  • Fast Reed-Muller Transform in OpenCL
                    marblecanyon

                    First, thank you everyone for your answers.

                    @himanshu.gautam: Thank you for your suggestion, I tried it out, but in this case at least, increasing the amount of work per work-item, while lowering the total number of work items, had a negative effect on performance. With increasing the amount of work per work-item 2x, performance dropped 20-30%, and a 4x increse led to 50-60% drop in performance. I'm curious why this happens?

                    @MicahVillmow: How should one then operate, in a manner that conforms to the specs, on arrays that contain boolean values in OpenCL kernels? 

                      • Fast Reed-Muller Transform in OpenCL
                        Meteorhead

                        Hi!

                        I do not know how costly exactly your kernel is. The ony you wrote down contains so little computation, it might not even be worth it to use GPUs for it. To efficiently work on boolean values, you might consider storing your data bitwise. It will require bitwise operations to access your data, but it will speed up memory operations by a lot. I have implemented a physics algorithm that works on boolean values and I stored them in uint4 vectors. One work item processed 128 values and needed only to load 16 bytes. Practically zero memory movement for a whole lot of computation, and shader capacity is plentiful.

                          • Fast Reed-Muller Transform in OpenCL
                            himanshu.gautam

                            hi,

                            meteor head explains that right. You can acheive the best from GPU if you are able to provide enough computational work to all the SIMDs other your performance is bound to decrease.

                            I suggested to combine workitems so that the data fetching for them could be clubbed. You can use profiler or SKA(if using windows) to find out the bottleneck and work accoedingly.

                      • Fast Reed-Muller Transform in OpenCL
                        MicahVillmow
                        marblecanyon,
                        Treat them as integers. That is what the hardware treats them on the GPU.