Really nobody with an answer / opinion / comment on this?
Would still be interested...
Memory access latency is hidden by finding another hardware thread for the ALUs to work on. You have declared that you only have a single hardware thread for the ALUs to work on. So you will get no latency hiding.
It's not possible to define a kernel so that it "releases" its LDS allocation. You need to start a new kernel (with no LDS allocation), which would mean saving data from the first kernel so that the second kernel could see it.