I'm implementing a GPGPU application in GLSL following Dominik Goedekke's tutorial basically to the letter. I'm using GL_TEXTURE_RECTANGLE_ARB, framebuffer objects, and my output texture is of type GL_RGBA32F_ARB. I read back the output texture using glReadPixels to a buffer with format GL_RGBA and type GL_FLOAT.
I get correct results, but the performance of glReadPixels is miserable. Reading back a 3 megapixel texture (48 megabytes of data) takes 2.1 seconds, which is only 21 MB/sec. I call glFinish before glReadPixels to ensure accurate timing results.
I'm using an x1650 pro card on linux with 8.54.3 drivers. This is an AGP card. The exact same program running on an older Nvidia 6200 AGP card completes the readback operation extremely quickly, in a tiny fraction of a second. I have tried using GL_BGRA instead of GL_RGBA to no effect, as well as using 8-byte aligned buffers and setting GL_PACK_ALIGNMENT and GL_UNPACK_ALIGNMENT accordingly. I have also tried reading back just one color channel, but the performance is always the same, ~2.1 sec.
I've spent a lot of time searching for information on this problem, and it seems to be a common issue, but there is very little hard information on the cause or a solution. Some blame the drivers. Others blame the hardware, suggesting that fast readback is only provided on high-end workstation video cards. These do not sound like reasonable explanations to me, and they're all about two years old anyway. The ATI OpenGL Programming and Optimization Guide suggests that ReadPixels is accelerated (page 13, Pixel Transfer Operations), but does not hint at what the expected performance should be.
I'd like to get the straight dope. Is this a driver issue, a hardware issue, or is there some more OpenGL magic that I need to do to improve the readback performance?