if your kernel compiled using 33 registers and 32 registers will give you max occupancy, it will usually be beneficial to shift that register to memory read/writes OR recompute it's contents as needed. A lot kernels that have some small logic will benefit from extra occupancy to better hide latency.
It's a pity that AMD or OpenCL does not support this... is there any update if this feature is planned? when?