I'm making a rasterizer where each work group processes a 32*32 tile. To maximize IO speed, the tile stores a buffer in local memory then copies it back to the image in main memory when it's done. Unfortunately I can't use the async copy functions because I need to the global buffer to be of type image_t so that I can hand it over to OpenGL once the processing is complete.
What's the fastest way to BLT my tile into the output image? Should I have SPU1 loop through the pixels and copy them in, or should I have all 16 SPUs do an interleaved copy, or a non-interleaved copy? Or is there some way to use the async copy functions with images?