Previously we had GPU_USE_SYNC_OBJECTS environment variable and it apparently does not work now. We have again those spinlocks in the runtime and the 100% CPU usage problem..performance drops. Thank you, but I am sticking with 2.4 until that's solved.
bitselect() still not mapped to BFI_INT. Why?
The BFE_UINT optimization (which is mentioned in the docs) for some reason is slower when it operates on values from __local memory, for some reason additional MOV instructions are generated and now some of my kernels are slower. Because MOV+BFE is slower than LSHR+AND.
offline compilation now broken too.
I am rather disappointed :(