Forgive me, but all the transistors in my brain are too burnt to fully understand what this code wishes to do. It seems to me however that all this calculation to get to the desired worksizes seems awfully redundant to me.
The thing is that in CUDA you set actual blocksize and then you tell what multiple of this you will require to cover your entire data. In OpenCL it is easier to achieve such coverage, as you only declare the size of your dataset (which will be globalSize itself if you need one thread for each piece of data), and you specify either the maximum available localSize, or something lower if desired. After that there are many useful built-in functions inside the kernel to query number of blocks, sizes of blocks, size of dataset and everything.
When porting CUDA code, you might want to consider rethinking these desiredWarps and all of those variables to make your life easier and not go crazy. I have ported CUDA code myself, and in the long run it makes your life easier to shift to indexing more fit for OpenCL, rather than keeping CUDA terminology all the way.
To shift to OCL indexing of kernels, you need to rewrite corresponding parts of the kernel (mainly just the init part), and it's so making your job easier.