You need 1KB of LUTs per work-item. That is huge for GPUs.
I would not suggest you to agree to that much LDS requirements and try to run 2 wavefronts on a CU, with only 32 work-items. Better way would be re-architect the algorithm to bring down the LDS requirements. I had a similar issue with a algorithm, and it gave me >2X performance after tweaking the algorithm.
I am not aware what vendor is providing a better 7970 with more overclocking facilities. May be someone else can help there.
For maximum overclocking ability you may want to look into the various GHz edition GPUs: