cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

binarysplit
Journeyman III

Efficiently copying from local buffer to image

Image broken into tiles for local processing. Best way to copy back to the image?

I'm making a rasterizer where each work group processes a 32*32 tile. To maximize IO speed, the tile stores a buffer in local memory then copies it back to the image in main memory when it's done. Unfortunately I can't use the async copy functions because I need to the global buffer to be of type image_t so that I can hand it over to OpenGL once the processing is complete.

What's the fastest way to BLT my tile into the output image? Should I have SPU1 loop through the pixels and copy them in, or should I have all 16 SPUs do an interleaved copy, or a non-interleaved copy? Or is there some way to use the async copy functions with images?

0 Likes
6 Replies
notzed
Challenger

Originally posted by: binarysplit I'm making a rasterizer where each work group processes a 32*32 tile. To maximize IO speed, the tile stores a buffer in local memory then copies it back to the image in main memory when it's done. Unfortunately I can't use the async copy functions because I need to the global buffer to be of type image_t so that I can hand it over to OpenGL once the processing is complete.

 

What's the fastest way to BLT my tile into the output image? Should I have SPU1 loop through the pixels and copy them in, or should I have all 16 SPUs do an interleaved copy, or a non-interleaved copy? Or is there some way to use the async copy functions with images?

 

SPU?  Are you in the right place here?  But assuming you just mean work items and not CELL BE processing units ...

You should always avoid a single thread doing anything that can be done in parallel, memory accesses included.  I normally just read/write a whole 16x16 tile at a time using a 16x16 workgroup size.  It's a bit of a trade-off depending on the problem, but 16x16 is often a sweet spot of register usage/local store size/maximum concurrency.

 

0 Likes

Thanks for the feedback. I don't really know what to refer to the work items' processors as. The AMD APP documentation calls them "Stream Processors", so I just guessed SPU would be an appropriate acronym. Is there a better acronym, or should I always write out "Stream Processor" or "Work Item"?

I'm using a fixed 16 work items per work group as my dataset is a tree that needs to be walked in an approximately depth-first way. It's more efficient to do this with a stack than a queue, and even if it was a queue(which would only allow breadth-first walking), I need the work items to be able to create more work items, which would require a rather inefficient multi-pass process. But that's another story. I'll try a few different methods after I've got it up and running.

I'll use an implementation like this for the copy:

for(int i = get_local_id(0); i < (TILE_W, TILE_H); i += get_local_size(0)) {

    write_imageui(colorBufImg, (int2)(tileLeft + (i % TILE_W), tileTop + (i / TILE_W), localColorBuf);

}

Thanks again.

0 Likes

Oh, one more thing I was wondering about. You said you just read a 16x16 tile with a 16x16 work group size. Does that mean there's a way to separate the "process data" and "copy cached data from local back to global" phases into multiple kernels? Or do your "process data" kernels run at 16x16 as well?

0 Likes

Originally posted by: binarysplit Thanks for the feedback. I don't really know what to refer to the work items' processors as. The AMD APP documentation calls them "Stream Processors", so I just guessed SPU would be an appropriate acronym. Is there a better acronym, or should I always write out "Stream Processor" or "Work Item"?

Ok sorry - i've used a CELL BE and SPU is what they call their processors.  The glossary in the amd docs say spu = shader processing unit, but it's never used.

I normally use 'thread', or 'workitem', although 'thread' is easy to confuse with cpu threads, and 'workitem' also refers to all work items, and not just the per-workgroup ones.

I'm using a fixed 16 work items per work group as my dataset is a tree that needs to be walked in an approximately depth-first way. It's more efficient to do this with a stack than a queue, and even if it was a queue(which would only allow breadth-first walking), I need the work items to be able to create more work items, which would require a rather inefficient multi-pass process. But that's another story. I'll try a few different methods after I've got it up and running.

16 threads will use less than 1/4 of the processing power on a gpu.  You want at least 64 threads.

If the 16 really is set in stone, you could simply run 4 lots at once, either as 32x32 tile or 64x16.  Just make sure each group of 16 works on the right set of data, and that any barriers are properly hit (all threads in the group have to hit any barrier).

I'll use an implementation like this for the copy:

for(int i = get_local_id(0); i < (TILE_W, TILE_H); i += get_local_size(0)) {

    write_imageui(colorBufImg, (int2)(tileLeft + (i % TILE_W), tileTop + (i / TILE_W), localColorBuf);

This'll let the compiler unroll the loop fully and pre-calculate most indices.

}



If you know it's going to be 16x16 and you have 16 work items, just hardcode the size:

 

for (int y=0;y < TILE_H) {

    write_imageui(colorBufImg, (int2)(tileLeft + lx, tileTop + y, localColorBuf[lx + y * TILE_W]);

}

Originally posted by: binarysplit Oh, one more thing I was wondering about. You said you just read a 16x16 tile with a 16x16 work group size. Does that mean there's a way to separate the "process data" and "copy cached data from local back to global" phases into multiple kernels? Or do your "process data" kernels run at 16x16 as well?

 

Generally i'd use 16x16 work items as well, but it depends on the problem.  Even if you're not using them all for the work, it can be more efficient to use extra work items to do the memory operations, although it might not be so important for image accesses.

Data load/data process can only be separated by an 'if' since local memory isn't persistent.  Or just changing the addressing for different parts of the code.

i.e. load the data using more threads, then just if out the ones not involved in calculation

 

 load using more threads

 if (lx < 16) {

  // do work

 }

 

Although if you're doing that you dont want any more than 64 threads as the i/o work gang.

Still, if you're generating a 16x16 tile, that means each of the 16 threads has to generate 16 pixels?  Even if you save some of the calculations, you're throwing away a potential for 16x parallelism there (unless you run 4 of them as above).

Also if you do run say 64 threads, split into groups of 16 for the calculation, you could re-arrange the 64 threads however you liked for the data load/store.  e.g. in 4x4 lots of 8x8 rather than 16 lots of (64)x1

 



 

 

0 Likes

Ah, that makes a lot of sense. I didn't know that APP combines scalar operations into vector operations so that effectively 4 work items are running on each of the 16 stream processors, making 64 work items / group the optimal. I'll adapt my algorithm to run on 64 work items per tile, as the local memory isn't large enough for me to do multiple tiles per CU.

Thanks for your help.

0 Likes

Originally posted by: binarysplit Ah, that makes a lot of sense. I didn't know that APP combines scalar operations into vector operations so that effectively 4 work items are running on each of the 16 stream processors, making 64 work items / group the optimal. I'll adapt my algorithm to run on 64 work items per tile, as the local memory isn't large enough for me to do multiple tiles per CU.

Thanks for your help.



Well not to confuse things more ... 🙂

No, the vliw stuff is separate (and in addition).  The vliw stuff is exposed by the isa (so the hardware 'thread' has all 4/5 to itself), but for OpenCL it's an implementation detail of the compiler and not something you can really affect directly.  Even the 'vector' operations are re-arranged across vliw instruction boundaries (there's no need to enforce SIMD).

My understanding (without looking it up) is that each unit is a 16 way processor but there's a 4 cycle pipeline delay (or something to that effect), so you need 16x4 threads to fully utilise it, otherwise pipeline slots are going to waste.

Not that it really matters what the details are: just use a multiple of 64 for the work items when you can and it works pretty well on all GPU hardware.

 

0 Likes