Hello all,
I have an HD 5870 and the ATI Stream V2.0 SDK installed. I had a question regarding coalescing global memory reads/writes in a kernel. The documentation says a wavefront is composed of 64 work items and it appears to suggest that 32 work items are processed at one time. If, in a given half-wavefront, the addresses to global items are not aligned and/or not completely sequential across increasing wavefront IDs, will the hardware make 32 individual global accesses (horrible bandwidth) or will it try to make as few coalesced global reads as necessary to fulfill the half-wavefront request (better bandwidth)?
Originally posted by: toddwbrownjr Hello all,
I have an HD 5870 and the ATI Stream V2.0 SDK installed. I had a question regarding coalescing global memory reads/writes in a kernel. The documentation says a wavefront is composed of 64 work items and it appears to suggest that 32 work items are processed at one time. If, in a given half-wavefront, the addresses to global items are not aligned and/or not completely sequential across increasing wavefront IDs, will the hardware make 32 individual global accesses (horrible bandwidth) or will it try to make as few coalesced global reads as necessary to fulfill the half-wavefront request (better bandwidth)?
You have coalesced read/writes when threads of an half-wavefront, cooperatively, read/write sequential data or data aligned on blocks of 128 bits. If these conditions aren't met, then you don't have coalesced read/writes.
In order to learn more watch here:
http://www.macresearch.org/opencl_episode4
http://www.macresearch.org/opencl_episode5
All,
I understand that to completely coalesce work items in a half-wavefront, the base address has to be aligned at a 128-byte boundary and the requested addresses of the work items has to increase sequentially (with no "holes"). However, my question is how the hardware reacts when this is not the case. For example, consider a half-wavefront with the following mapping from wavefront ID to float memory address:
ID 0 address 0
ID 1 address 1 (byte address 4, DWord address 1--so its sequential)
.....
ID 30 address 30 (byte address 120, DWord address 30--so its sequential)
ID 31 address 35 (not sequential from ID 30)
The hardware can handle this several ways. 1) 2 reads from memory (one 'coalesced' read from address 0-31 (throwing away the value at 31) and one 'non-coalesced' read to get the value at address 35). 2) 31 sequential 'non-coalesced' reads from memory (because offset and sequential requirements are not met). I ask this, since NVIDIA used to handle this issue with 32 individual 'non-coalesced' memory requests (horrible bandwidth), but now makes as few 'coalesced' requests to service the request, which in the above example would be 2. Does anyone know how the ATI driver/hardware handles this case? I might be able to determine this with the profiler, but I am a noob with ATI, so I am not that far yet.
Thanks
First of all half warp is term used in NVidia's hardware. On radeons it doesn't work that way. If I remember correctly the warp is issued in 4 cycles ( in group of 16 ) - but from what I know it isn't as important as half warp for nvidia's hardware.
Second of all on ATI's hardware you should use "type4" ( so float4, int4 ) for reading/writing. When you use "type" then you have performance/4 ( at least for memory reads - haven't tested this for writes ).
And any answer on topic of coalescing/cache size/cache line should be taken with big grain of salt. ATI has long history of not giving any info on cache behaviour ( or giving conflicting info ).
Micah written that only being friendly to cache is important.
On 4xxx hardware cache line is "128 bytes". So 8 threads ( using float4 ) read whole cache line. This would imply that 8 threads reading continuous memory give full speed.
But test show that it isn't true.
You achive maximum speed when full warp reads continuous memory. Slow degradation is for 32/32 split ( 32 threads continuous read , next 32 continuous but starting from other address). 16/16/16/16 split gives next slight reduction in speed.
But 8/8/8/8/8/8/8/8 is significantly slower than full warp read.
Of course for 5xxx results should differ. But it's impossible to verify what Micah said as there is NO OFFICIAL INFO about cache size/architecture/cache line size on 5xxx cards.
If OpenCL only uses global memory and the global memory is not cached, then why is the cache important in OpenCL?
Originally posted by: ryta1203 If OpenCL only uses global memory and the global memory is not cached, then why is the cache important in OpenCL?
OpenCL uses uavs ( not global ) to access memory. On 5xxx it's translated to VFETCH instruction - which if i'm not mistaken uses vertex cache ( which has been merged with texture cache ? ).
On 4xxx uav is translated into standard global access ( without cache ).
So two things here:
1. Future generations of ATI GPUs will have cached global memory.
2. Currently, the OpenCL implementation does not always use global memory? Instead uses "uavs" which can use vertex cache through the texture units?
Is this information in the documentation?