I'm working on a program which is designed to do square and multiply operations on high precision fixed-point numbers. Each big number consists of a 24 bit whole part and a configurable 384*n-24 bit fractional part. Generally there is no way to keep the whole number in registers/LDS, so all the numbers are stored in global memory. The result of the square/multiply operation is calculated on blocks to reduce memory access. In the current implementation blocksize is 16x8.
Each block consists of the following operations:
- read 6 dwords starting from MSB, split it to 8 24bit numbers. (each limb is 24 bit, because of v_mad_i32_i24)
- read 6 dwords starting from somewhere inside the number, split it to 8 limbs and insert into a 24 element circular buffer (unrolled code, not indirectly indexed)
- do the calculations (accumulate 24bit*24bit products in a 2*32bit register pair -> v_mad24, v_madhi24, v_add, v_addc )
- adjust each accumulated column to 24bit (3 instructions: and, shr, add)
- go to next block
So basically it's 12*4=48 byte reads interleaved with 16*8*4+16*3=560 alu instructions(not counting the 32->24 bit split).
(Here's an illustration, how the square operation looks like (on 48x24bit limbs): http://x.pgy.hu/~worm/het/bigint_sqr.PNG )
Mem/Alu ratio is at least 0.085 bytes/alu_inst. I think it's totally ok for the 7970, if not, please tell, I'm not 100% sure of this.
My problem is related to how I partitioning data. I've found out that the program woks best when I launch only 1024 worker threads (only 16 wavefronts). If I use all the 2048, it will be much slower. Recently I found out in the opencl manual that it has a serious channel-conflict.
My data format was like: uint bigLimbs[NumberOfWorkerThreads, NumberOfBignums, LimbsPerBignum] (where NumberOfBignums is the available big 'registers' for each worker-threads, for example 5 bignums are enough for a Mandelbrot calculation, NumberOfWorkerThreads: this could be 6K at maximum so every CU can have 3x wavefronts in queue, LimbsPerBignum can be 12, 24, 36, ...infinity based on the required precision)
I know this is wrong, and I'm planning to do 256 byte aligned reads in all the Compute Units. For this I will reorder the indices of the array like -> uint bigLimbs[NumberOfBignums, LimbsPerBignum,NumberOfWorkerThreads] this way I can do 64*4 byte 'burst' reads for all the 64 workitems of all the CUes. I hope this way I can get near to peak ALU and peak MEM bandwith.
But my real question is this (thank you if you've got here already ):
There are different kinds of memory read instructions based on element size, which would be the better for reading 6x dwords?
- using 6x tbuffer_load_format_x (each of them reading 256bytes aligned, stride between threads is 4)
- using 2x tbuffer_load_format_xyz (each of them reads 768bytes, aligned to 256byte boundaries, stride between threads is 12)
The second one seems like more optimal, but what would be the performance difference if I rearrange the bigLimbs array to this:
uint bigLimbs[NumberOfBignums, LimbsPerBignum/3,NumberOfWorkerThreads,3] (3 is for format_xyz)
Using 2x tbuffer_load_format_xyz instead of 6x tbuffer_load_format_x is only wastes 1% alu performance, but what do you think, is it worth to use the bigger 768byte reads, with that complicated memory rearrangement (grouping every 3*32bits to use format xyz)?
Thank You in advance.