Hello all,
I have an HD 5870 and the ATI Stream V2.0 SDK installed. I had a question regarding coalescing global memory reads/writes in a kernel. The documentation says a wavefront is composed of 64 work items and it appears to suggest that 32 work items are processed at one time. If, in a given half-wavefront, the addresses to global items are not aligned and/or not completely sequential across increasing wavefront IDs, will the hardware make 32 individual global accesses (horrible bandwidth) or will it try to make as few coalesced global reads as necessary to fulfill the half-wavefront request (better bandwidth)?
Originally posted by: toddwbrownjr Hello all,
I have an HD 5870 and the ATI Stream V2.0 SDK installed. I had a question regarding coalescing global memory reads/writes in a kernel. The documentation says a wavefront is composed of 64 work items and it appears to suggest that 32 work items are processed at one time. If, in a given half-wavefront, the addresses to global items are not aligned and/or not completely sequential across increasing wavefront IDs, will the hardware make 32 individual global accesses (horrible bandwidth) or will it try to make as few coalesced global reads as necessary to fulfill the half-wavefront request (better bandwidth)?
You have coalesced read/writes when threads of an half-wavefront, cooperatively, read/write sequential data or data aligned on blocks of 128 bits. If these conditions aren't met, then you don't have coalesced read/writes.
In order to learn more watch here:
http://www.macresearch.org/opencl_episode4
http://www.macresearch.org/opencl_episode5
All,
I understand that to completely coalesce work items in a half-wavefront, the base address has to be aligned at a 128-byte boundary and the requested addresses of the work items has to increase sequentially (with no "holes"). However, my question is how the hardware reacts when this is not the case. For example, consider a half-wavefront with the following mapping from wavefront ID to float memory address:
ID 0 address 0
ID 1 address 1 (byte address 4, DWord address 1--so its sequential)
.....
ID 30 address 30 (byte address 120, DWord address 30--so its sequential)
ID 31 address 35 (not sequential from ID 30)
The hardware can handle this several ways. 1) 2 reads from memory (one 'coalesced' read from address 0-31 (throwing away the value at 31) and one 'non-coalesced' read to get the value at address 35). 2) 31 sequential 'non-coalesced' reads from memory (because offset and sequential requirements are not met). I ask this, since NVIDIA used to handle this issue with 32 individual 'non-coalesced' memory requests (horrible bandwidth), but now makes as few 'coalesced' requests to service the request, which in the above example would be 2. Does anyone know how the ATI driver/hardware handles this case? I might be able to determine this with the profiler, but I am a noob with ATI, so I am not that far yet.
Thanks
First of all half warp is term used in NVidia's hardware. On radeons it doesn't work that way. If I remember correctly the warp is issued in 4 cycles ( in group of 16 ) - but from what I know it isn't as important as half warp for nvidia's hardware.
Second of all on ATI's hardware you should use "type4" ( so float4, int4 ) for reading/writing. When you use "type" then you have performance/4 ( at least for memory reads - haven't tested this for writes ).
And any answer on topic of coalescing/cache size/cache line should be taken with big grain of salt. ATI has long history of not giving any info on cache behaviour ( or giving conflicting info ).
Micah written that only being friendly to cache is important.
On 4xxx hardware cache line is "128 bytes". So 8 threads ( using float4 ) read whole cache line. This would imply that 8 threads reading continuous memory give full speed.
But test show that it isn't true.
You achive maximum speed when full warp reads continuous memory. Slow degradation is for 32/32 split ( 32 threads continuous read , next 32 continuous but starting from other address). 16/16/16/16 split gives next slight reduction in speed.
But 8/8/8/8/8/8/8/8 is significantly slower than full warp read.
Of course for 5xxx results should differ. But it's impossible to verify what Micah said as there is NO OFFICIAL INFO about cache size/architecture/cache line size on 5xxx cards.
If OpenCL only uses global memory and the global memory is not cached, then why is the cache important in OpenCL?
Originally posted by: ryta1203 If OpenCL only uses global memory and the global memory is not cached, then why is the cache important in OpenCL?
OpenCL uses uavs ( not global ) to access memory. On 5xxx it's translated to VFETCH instruction - which if i'm not mistaken uses vertex cache ( which has been merged with texture cache ? ).
On 4xxx uav is translated into standard global access ( without cache ).
Originally posted by: hazeman
You achive maximum speed when full warp reads continuous memory.
Speaking about warps reading memory, is there a fixed map necessary for fast reading? I mean, first thread reads first 4 bytes, second - next and so on. If so, how the tread order can be defined? Which thread should read first, (second, third..) in case of 2 or 3 dimension workgroup.
Speaking about warps reading memory, is there a fixed map necessary for fast reading? I mean, first thread reads first 4 bytes, second - next and so on. If so, how the tread order can be defined? Which thread should read first, (second, third..) in case of 2 or 3 dimension workgroup.
Work-groups are always divided into linear wavefronts. e.g. if your group size is 32, two rows of the group will create a wavefront.
Originally posted by: gaurav.garg Speaking about warps reading memory, is there a fixed map necessary for fast reading? I mean, first thread reads first 4 bytes, second - next and so on. If so, how the tread order can be defined? Which thread should read first, (second, third..) in case of 2 or 3 dimension workgroup.
Work-groups are always divided into linear wavefronts. e.g. if your group size is 32, two rows of the group will create a wavefront.
Yes, it is true. But what I want to know is how exactly this division is performed. If I have a workgroup like this
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8
4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8
5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8
6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8
7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8
8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8
which of this work-items form a wavefront?
x.y creates local ID (x, y). Now items from 1.1 to 8.8 will create a single wavefront with 1.1 being first thread, 1.2 second thread , 2.1 9th thread.
Thanks, gaurav.garg
So two things here:
1. Future generations of ATI GPUs will have cached global memory.
2. Currently, the OpenCL implementation does not always use global memory? Instead uses "uavs" which can use vertex cache through the texture units?
Is this information in the documentation?
Originally posted by: MicahVillmow This is what the UAV read can translate into on the 5XXX. 06 TEX: ADDR(64) CNT(1) 12 VFETCH R0.x___, R0.w, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) Which is a vertex fetch that goes through the texture unit. The uncached bit itself is not set, so this is a cached read because the compiler can determine that the read and write do not overlap. However, there are situations where the uncached bit will be set and the read will be uncached. Here is an example of an uncached read through texture. 137 TEX: ADDR(2384) CNT(1) 287 RD_SCRATCH R24._y__, VEC_PTR[6], ARRAY_SIZE(8) ELEM_SIZE(3) UNCACHED
Micah,
I modified MMM SDK code so that compiler cannot be sure about overlap memory read/writes. The generated ISA contains the same code VFETCH under TEX clause. IIRC, we need to change TEX clause to TEX_TC to make fetches cached. Right?
I see only an addition WAIT_ACK instruction. That just makes sure writes acks are complete and should have nothing to do with cached reads.
#define TILEX 4
#define TILEX_SHIFT 2
#define TILEY 4
#define TILEY_SHIFT 2
__kernel void mmmKernel_local(__global float4 *matrixA,
__global float4 *matrixB,
__global float4* matrixC,
int widthA,
__local float4 *blockA)
{
int blockPos = get_local_id(0) + get_local_size(0) * (get_local_id(1) << TILEY_SHIFT); //Should be : localId * (TILEX / 4) (float4)
/* Position of thread will be according to the number of values it writes i.e TILE size */
int globalPos = get_global_id(0) + (get_global_id(1) << TILEY_SHIFT) * get_global_size(0);
/* Each thread writes 4 float4s */
float4 sum0 = (float4)(0);
float4 sum1 = (float4)(0);
float4 sum2 = (float4)(0);
float4 sum3 = (float4)(0);
int temp = widthA / 4;
/* This loop runs for number of blocks of A in horizontal direction */
for(int i = 0; i < (temp / get_local_size(0)); i++)
{
/* Calculate global ids of threads from the particular block to load from matrix A depending on i */
int globalPosA = i * get_local_size(0) + get_local_id(0) + (get_global_id(1) << TILEY_SHIFT) * temp;
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
/* Load values in blockA from matrixA */
blockA[blockPos] = matrixA[globalPosA];
blockA[blockPos + get_local_size(0)] = matrixA[globalPosA + temp];
blockA[blockPos + 2 * get_local_size(0)] = matrixA[globalPosA + 2 * temp];
blockA[blockPos + 3 * get_local_size(0)] = matrixA[globalPosA + 3 * temp];
barrier(CLK_LOCAL_MEM_FENCE);
/* Calculate global ids of threads from the particular block to load from matrix B depending on i */
int globalPosB = get_global_id(0) + ((i * get_local_size(0)) << TILEY_SHIFT) * get_global_size(0);
/* This loop runs for number of threads in horizontal direction in the block of A */
for(int j = 0; j < get_local_size(0) * 4; j=j+4)
{
/* Load 4 float4s from blockA : access patters = strided from local memory */
float4 tempA0 = blockA[(j >> 2) + get_local_id(1) * TILEY * get_local_size(0)];
float4 tempA1 = blockA[(j >> 2) + (get_local_id(1) * TILEY + 1) * get_local_size(0)];
float4 tempA2 = blockA[(j >> 2) + (get_local_id(1) * TILEY + 2) * get_local_size(0)];
float4 tempA3 = blockA[(j >> 2) + (get_local_id(1) * TILEY + 3) * get_local_size(0)];
/* Load corresponding values from matrixB, access pattern = linear from global memory */
float4 tempB0 = matrixB[globalPosB + j * get_global_size(0)]; //Should be localId.x * (TILEX / 4)
float4 tempB1 = matrixB[globalPosB + (j + 1) * get_global_size(0)];
float4 tempB2 = matrixB[globalPosB + (j + 2) * get_global_size(0)];
float4 tempB3 = matrixB[globalPosB + (j + 3) * get_global_size(0)];
sum0.x += tempA0.x * tempB0.x + tempA0.y * tempB1.x + tempA0.z * tempB2.x + tempA0.w * tempB3.x;
sum0.y += tempA0.x * tempB0.y + tempA0.y * tempB1.y + tempA0.z * tempB2.y + tempA0.w * tempB3.y;
sum0.z += tempA0.x * tempB0.z + tempA0.y * tempB1.z + tempA0.z * tempB2.z + tempA0.w * tempB3.z;
sum0.w += tempA0.x * tempB0.w + tempA0.y * tempB1.w + tempA0.z * tempB2.w + tempA0.w * tempB3.w;
sum1.x += tempA1.x * tempB0.x + tempA1.y * tempB1.x + tempA1.z * tempB2.x + tempA1.w * tempB3.x;
sum1.y += tempA1.x * tempB0.y + tempA1.y * tempB1.y + tempA1.z * tempB2.y + tempA1.w * tempB3.y;
sum1.z += tempA1.x * tempB0.z + tempA1.y * tempB1.z + tempA1.z * tempB2.z + tempA1.w * tempB3.z;
sum1.w += tempA1.x * tempB0.w + tempA1.y * tempB1.w + tempA1.z * tempB2.w + tempA1.w * tempB3.w;
sum2.x += tempA2.x * tempB0.x + tempA2.y * tempB1.x + tempA2.z * tempB2.x + tempA2.w * tempB3.x;
sum2.y += tempA2.x * tempB0.y + tempA2.y * tempB1.y + tempA2.z * tempB2.y + tempA2.w * tempB3.y;
sum2.z += tempA2.x * tempB0.z + tempA2.y * tempB1.z + tempA2.z * tempB2.z + tempA2.w * tempB3.z;
sum2.w += tempA2.x * tempB0.w + tempA2.y * tempB1.w + tempA2.z * tempB2.w + tempA2.w * tempB3.w;
sum3.x += tempA3.x * tempB0.x + tempA3.y * tempB1.x + tempA3.z * tempB2.x + tempA3.w * tempB3.x;
sum3.y += tempA3.x * tempB0.y + tempA3.y * tempB1.y + tempA3.z * tempB2.y + tempA3.w * tempB3.y;
sum3.z += tempA3.x * tempB0.z + tempA3.y * tempB1.z + tempA3.z * tempB2.z + tempA3.w * tempB3.z;
sum3.w += tempA3.x * tempB0.w + tempA3.y * tempB1.w + tempA3.z * tempB2.w + tempA3.w * tempB3.w;
}
/* Write 16 values to matrixC */
matrixA[globalPos] = sum0;
matrixA[globalPos + get_global_size(0)] = sum1;
matrixA[globalPos + 2 * get_global_size(0)] = sum2;
matrixA[globalPos + 3 * get_global_size(0)] = sum3;
}
}
Originally posted by: MicahVillmow Gaurav, It is the instruction itself that determines the caching properties, not the clause type. The CF instructions itself just describe what cache to use, TEX/VTX_TC us texture and VTX uses vertex, not whether the data is read into the cache or not. Please refer to the ISA spec for a description of the various instruction types. For vfetch instructions, the mega_fetch field specifies the number of bytes that are read into the cache from memory whereas for scratch/global it is the CACHED/UNCACHED bit.
So if the global memory is linear and the cache is not, how is the translation done? Does it take full advantage of the cache?
Originally posted by: ryta1203 Originally posted by: MicahVillmow Gaurav, It is the instruction itself that determines the caching properties, not the clause type. The CF instructions itself just describe what cache to use, TEX/VTX_TC us texture and VTX uses vertex, not whether the data is read into the cache or not. Please refer to the ISA spec for a description of the various instruction types. For vfetch instructions, the mega_fetch field specifies the number of bytes that are read into the cache from memory whereas for scratch/global it is the CACHED/UNCACHED bit.
So if the global memory is linear and the cache is not, how is the translation done? Does it take full advantage of the cache?
Anyone?