I have an OpenCL program that uses 54 registers per thread. It runs 3x slower on 5870 compard with nvidia gtx470 using similar configurations.
I heard that 5870 only allows ~30 registers per thread and the rest will be spilled to the global memory. Is this true? anything I can optimize?