quadboon,
A lot depends on the kernel code and your global worksize used.
One possible reason can be that while using int8, you are able to generate enough wavefronts to keep a 6970 busy. But 6990 has double the number of cores and you might be able to get more performance by increasing your global worksize or usgin int4 instead ot int8.