If I run my kernel with global work group size of 524289 instead of 524288 (which is exact power of two), the performance drops 5x. The size of memory buffers allocated does not change. So this cant be allignment issue.
Why such big penalties? Note that sizes 524290 and 524291 work fine.