3 Replies Latest reply on Oct 27, 2011 11:56 AM by Meteorhead

    2D work-groups



      I'm currently playing with work-group size and use the following :


      globalWork = cl::NDRange(3840, 4);
      localWork = cl::NDRange(32, 4);

      And so, is it correct if I use the following to get the current item from a sequential array ?


      int row = get_global_id(1);
      int col = get_global_id(0);
      size_t gid = row * get_global_size(0) + col;

        • 2D work-groups

          Yes, it is correct. Although your naming convention does seem funny to me, as I would expect one row of any dataset to have the same Y coordinate, and one column to have same X coordinate through the dataset, but that is only a taste of interpretation. And this has nothing to do with storing data row-major or comlumn-major, rather than interpreting the data.

            • 2D work-groups


              In fact I have a problem to translate some CUDA code to OpenCL :


              int tasksCount = 489984;
              int desiredWarps = (tasksCount + 31) / 32;
              Vec2i blockSize(32, 4);
              int blockWarps = (blockSize.x * blockSize.y + 31) / 32;
              Vec2i gridSize((desiredWarps + blockWarps - 1) / blockWarps, 1);

              return module->launchKernelTimed(kernel, blockSize, gridSize);


              I have the following :


              int tasksCount = 489984;
              int blockWidth = 32;
              int blockHeight = 4;
              int desiredWarps = (tasksCount + 31) / 32;
              localWork = cl::NDRange(blockWidth, blockHeight);
              int blockWarps = (blockWidth * blockHeight + 31) / 32;
              int bWH = blockWidth * blockHeight;

              int gridX = (desiredWarps + blockWarps - 1) / blockWarps;
              //if (gridX % bWH != 0) gridX = (gridX / bWH + 1) * bWH;
              if (gridX % blockWidth != 0) gridX = (gridX / blockWidth + 1) * blockWidth;
              globalWork = cl::NDRange(gridX, blockHeight);

                • 2D work-groups

                  Forgive me, but all the transistors in my brain are too burnt to fully understand what this code wishes to do. It seems to me however that all this calculation to get to the desired worksizes seems awfully redundant to me.

                  The thing is that in CUDA you set actual blocksize and then you tell what multiple of this you will require to cover your entire data. In OpenCL it is easier to achieve such coverage, as you only declare the size of your dataset (which will be globalSize itself if you need one thread for each piece of data), and you specify either the maximum available localSize, or something lower if desired. After that there are many useful built-in functions inside the kernel to query number of blocks, sizes of blocks, size of dataset and everything.

                  When porting CUDA code, you might want to consider rethinking these desiredWarps and all of those variables to make your life easier and not go crazy. I have ported CUDA code myself, and in the long run it makes your life easier to shift to indexing more fit for OpenCL, rather than keeping CUDA terminology all the way.

                  To shift to OCL indexing of kernels, you need to rewrite corresponding parts of the kernel (mainly just the init part), and it's so making your job easier.