That's quite a bit of data and it sounds like you're doing a lot of samples from it (300 samples per shader invocation). I would not expect this to run terribly fast. What is the performance that you're seeing - are we talking 2fps vs. 5fps, or 30fps vs. 50fps?
Have you tried running your application with GPUPerfStudio (http://developer.amd.com/tools/graphics-development/gpu-perfstudio-2/) and seeing if there is anything unusual with the performance counters?
Is there any chance you could share your application with us?
I'm not the owner of the application I'll try to ask if I can share.
I tried the gpu-perfstudio but it refuses to get numbers from the application. gDebugger works well but I don't see anything different than running it with nvidia.
I'll try to further explain what's going on, my application has 2 stages, the first stage renders a 100 images (I've changed the setting a bit to be able to explain it better it doesn't change the problem I've checked), each is 160x120 size into a single channel texture, this stage is also slow but it's a little complex to explain so I'll focus on the second stage.
I'm not rendering into the screen, I'm using a fbo, which contain 4 textures: 1 texture is 1600x1200 single channel float, another is 1600x1200 4 channels float and another 2 160x120 singel channel, one float and another unsinged 8. in the second stage (it start after the making sure that the first finished with glFinish) I'm using the single channel texture 1600x1200 and the 2 small ones as input, I'm comparing each image out of the 100 to the two small textures, I summarize each column into the 1600x1200 4 channel texture (output), I'm only using one line per image (its a big waste and I'm planing to change that later). the number of images varies from 1 to a 100.
I'm measuring performance on the cpu with rtdsc, I see that with the nvidia card the rendering part takes 0.77 mega cycles, waiting for the render to finish (wait for glFinish) takes 1186 mega cyacles, with the 7850 the render takes 1,576 mega cycles and the glFinish takes 6,421 mega cycles.
The machine with the 7850 is an i7 3770 and the machine with the nvidia is i7 2600.
I've checked with both the latest release and the latest beta drivers.
One thing I've noticed by trying to run the gpu-perf is that my application is a 32 bit compiled running on a 64 bit os.
it start after the making sure that the first finished with glFinish
It should (almost) never be necessary to call glFinish - it will be devastating to performance. Please remove any calls to glFinish from your application and try again.
I'm measuring performance on the cpu with rtdsc,
Don't use RDTSC to measure GPU performance. First, if you're running on a multi-CPU system or a system with any kind of clock throttling, then the results of RDTSC become pretty meaningless. Even if you get a stable result from it, getting a CPU timestamp during rendering only tells you the software overhead of the driver and taking timestamps around glFinish (which you shouldn't be calling anyway) only measures the presentation latency.
If you want to measure GPU time taken to render, use an OpenGL timer query.