I was trying out OpenCL with a small benchmark application I wrote. The kernel is a simple numerical calculation with unsigned ints, and clEnqueueNDRangeKernel is called with a 1-dimensional work load. So there's nothing strange going on in the benchmark, no cooperation between threads, each thread simply reads 3 values from input arrays (using "get_global_id (0)" as the index), calculates something from it, and stores the result in an output array (same index).
Very strangely, the time taken to finish 10 iterations of this work varies widely when just changing the work size from 1,000,000 to 1,194,877 (this happens both with the CPU and the GPU).
Here are the benchmark results for the GPU (HD 5750):
work_size = 1000000, time taken = 4296 ms
work_size = 1194877, time taken = 36612 ms
work_size = 2000000, time taken = 7913 ms
As you can see there's a huge slowdown when using that special number as the input size. A bigger (but rounder) number actually runs much faster than this odd input size, which means it would be faster to add some garbage to the array just to make the input bigger even if we want to calculate fewer values.
The OpenCL SDK must be doing something strange here if it can't run non-round input sizes efficiently.