cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

toddwbrownjr
Journeyman III

OpenCL Coalescing To Global Memory

Hello all,

     I have an HD 5870 and the ATI Stream V2.0 SDK installed.  I had a question regarding coalescing global memory reads/writes in a kernel.  The documentation says a wavefront is composed of 64 work items and it appears to suggest that 32 work items are processed at one time.  If, in a given half-wavefront, the addresses to global items are not aligned and/or not completely sequential across increasing wavefront IDs, will the hardware make 32 individual global accesses (horrible bandwidth) or will it try to make as few coalesced global reads as necessary to fulfill the half-wavefront request (better bandwidth)?

Tags (1)
0 Likes
20 Replies
Fr4nz
Journeyman III

OpenCL Coalescing To Global Memory

Originally posted by: toddwbrownjr Hello all,

 

     I have an HD 5870 and the ATI Stream V2.0 SDK installed.  I had a question regarding coalescing global memory reads/writes in a kernel.  The documentation says a wavefront is composed of 64 work items and it appears to suggest that 32 work items are processed at one time.  If, in a given half-wavefront, the addresses to global items are not aligned and/or not completely sequential across increasing wavefront IDs, will the hardware make 32 individual global accesses (horrible bandwidth) or will it try to make as few coalesced global reads as necessary to fulfill the half-wavefront request (better bandwidth)?

 

 

You have coalesced read/writes when threads of an half-wavefront, cooperatively, read/write sequential data or data aligned on blocks of 128 bits. If these conditions aren't met, then you don't have coalesced read/writes.

 

In order to learn more watch here:

http://www.macresearch.org/opencl_episode4

http://www.macresearch.org/opencl_episode5

0 Likes
toddwbrownjr
Journeyman III

OpenCL Coalescing To Global Memory

All,

I understand that to completely coalesce work items in a half-wavefront, the base address has to be aligned at a 128-byte boundary and the requested addresses of the work items has to increase sequentially (with no "holes").  However, my question is how the hardware reacts when this is not the case.  For example, consider a half-wavefront with the following mapping from wavefront ID to float memory address:

ID 0 address 0 

ID 1 address 1 (byte address 4, DWord address 1--so its sequential)

.....

ID 30 address 30 (byte address 120, DWord address 30--so its sequential)

ID 31 address 35 (not sequential from ID 30)

The hardware can handle this several ways.  1) 2 reads from memory (one 'coalesced' read from address 0-31 (throwing away the value at 31) and one 'non-coalesced' read to get the value at address 35).  2) 31 sequential 'non-coalesced' reads from memory (because offset and sequential requirements are not met).  I ask this, since NVIDIA used to handle this issue with 32 individual 'non-coalesced' memory requests (horrible bandwidth), but now makes as few 'coalesced' requests to service the request, which in the above example would be 2.  Does anyone know how the ATI driver/hardware handles this case?  I might be able to determine this with the profiler, but I am a noob with ATI, so I am not that far yet.

Thanks 

 

0 Likes
MicahVillmow
Staff
Staff

OpenCL Coalescing To Global Memory

toddwbrownjr,
Please see the hardware overview to get a better idea of how our wavefronts are executed on the hardware. The hardware itself does not do load coalescing between threads. If the reads are sequential, then you will hit the cache/memory lines in a friendly manner and can possibly achieve peak performance, but if the reads are random, then you will not achieve peak. In both cases, the same number of read instructions are executed.
http://developer.amd.com/gpu/A...ages/Publications.aspx
0 Likes
hazeman
Adept II

OpenCL Coalescing To Global Memory

First of all half warp is term used in NVidia's hardware. On radeons it doesn't work that way. If I remember correctly the warp is issued in 4 cycles ( in group of 16 ) - but from what I know it isn't as important as half warp for nvidia's hardware.

Second of all on ATI's hardware you should use "type4" ( so float4, int4 ) for reading/writing. When you use "type" then you have performance/4 ( at least for memory reads - haven't tested this for writes ).

And any answer on topic of coalescing/cache size/cache line should be taken with big grain of salt. ATI has long history of not giving any info on cache behaviour ( or giving conflicting info ).

Micah written that only being friendly to cache is important.

On 4xxx hardware cache line is "128 bytes". So 8 threads ( using float4 ) read whole cache line. This would imply that 8 threads reading continuous memory give full speed.

But test show that it isn't true.

You achive maximum speed when full warp reads continuous memory. Slow degradation is for 32/32 split ( 32 threads continuous read , next 32 continuous but starting from other address). 16/16/16/16 split gives next slight reduction in speed.

But 8/8/8/8/8/8/8/8 is significantly slower than full warp read.

Of course for 5xxx results should differ. But it's impossible to verify what Micah said as there is NO OFFICIAL INFO about cache size/architecture/cache line size on 5xxx cards.

 

0 Likes
ryta1203
Journeyman III

OpenCL Coalescing To Global Memory

If OpenCL only uses global memory and the global memory is not cached, then why is the cache important in OpenCL?

0 Likes
MicahVillmow
Staff
Staff

OpenCL Coalescing To Global Memory

Ryta,
Currently global memory is not cached, but that is not always going to be the case. But the reason being cache friendly is important is that when you read data, it is not that just your requested data gets read in, but the whole cache line will get read, which is a 4x2 block of memory. If the neighboring thread uses any data from the read from memory, it gets delivered to that thread without requiring another trip out to memory. This is why being cache friendly is important.
0 Likes
hazeman
Adept II

OpenCL Coalescing To Global Memory

Originally posted by: ryta1203 If OpenCL only uses global memory and the global memory is not cached, then why is the cache important in OpenCL?


OpenCL uses uavs ( not global ) to access memory. On 5xxx it's translated to VFETCH instruction - which if i'm not mistaken uses vertex cache ( which has been merged with texture cache ? ).

On 4xxx uav is translated into standard global access ( without cache ).

 

0 Likes
MicahVillmow
Staff
Staff

OpenCL Coalescing To Global Memory

This is what the UAV read can translate into on the 5XXX.
06 TEX: ADDR(64) CNT(1)
12 VFETCH R0.x___, R0.w, fc156 MEGA(4)
FETCH_TYPE(NO_INDEX_OFFSET)

Which is a vertex fetch that goes through the texture unit. The uncached bit itself is not set, so this is a cached read because the compiler can determine that the read and write do not overlap. However, there are situations where the uncached bit will be set and the read will be uncached.
Here is an example of an uncached read through texture.
137 TEX: ADDR(2384) CNT(1)
287 RD_SCRATCH R24._y__, VEC_PTR[6], ARRAY_SIZE(8) ELEM_SIZE(3) UNCACHED
0 Likes
ryta1203
Journeyman III

OpenCL Coalescing To Global Memory

So two things here:

1. Future generations of ATI GPUs will have cached global memory.

2. Currently, the OpenCL implementation does not always use global memory? Instead uses "uavs" which can use vertex cache through the texture units?

Is this information in the documentation?

0 Likes