You are right, that PCIe communication can significantly throw back performance, and in cases like the one you mention, where you only do a portion of the physics (namely coll. det.) on the GPU the overhead relative to amount of work increases.
Possible solution to your problem could be to compute all aspects of phyics on the GPU. Particle-like physics systems can all be paralellized by the number of objects, including collision detection. This way you can reduce all incoming transfer to avatar input and the output to new model coordinates. These hopefully can fit into PCIe buffers. Biggest problems arise during GPU physics destruction of the environment. Deforming existing objects, spawning new ones, and breaking existing ones (which is deforming and spawning). Implementing such capabilities into a gpu-accel game engine would be revolutionary.
Unfortunately crafting such an engine requires serious gpgpu expertise to know where one can cut-off memory bandwith efficiently and not waste shader capacity, specially if you're looking to make something better than usual MMO engines limited to some 32 players.
Most likely the best solution to your problem could be the new gen Llano processors. Quad-core processor with an extremely potent and programmable IGP. Best part of the APU setup (the one that causes the biggest problem in your case) is that PCIe communication is negated alltogether. Mapping and unmapping of objects between host-device will most likely be just pointer passing from IGP to CPU.
However if you can create an engine that runs well on a discrete graphics card, than game servers can upscale practically indefinately.