I'm working on a program which is designed to do square and multiply operations on high precision fixed-point numbers. Each big number consists of a 24 bit whole part and a configurable 384*n-24 bit fractional part. Generally there is no way to keep the whole number in registers/LDS, so all the numbers are stored in global memory. The result of the square/multiply operation is calculated on blocks to reduce memory access. In the current implementation blocksize is 16x8.
Each block consists of the following operations:
- read 6 dwords starting from MSB, split it to 8 24bit numbers. (each limb is 24 bit, because of v_mad_i32_i24)
- read 6 dwords starting from somewhere inside the number, split it to 8 limbs and insert into a 24 element circular buffer (unrolled code, not indirectly indexed)
- do the calculations (accumulate 24bit*24bit products in a 2*32bit register pair -> v_mad24, v_madhi24, v_add, v_addc )
- adjust each accumulated column to 24bit (3 instructions: and, shr, add)
- go to next block
So basically it's 12*4=48 byte reads interleaved with 16*8*4+16*3=560 alu instructions(not counting the 32->24 bit split).
(Here's an illustration, how the square operation looks like (on 48x24bit limbs): http://x.pgy.hu/~worm/het/bigint_sqr.PNG )
Mem/Alu ratio is at least 0.085 bytes/alu_inst. I think it's totally ok for the 7970, if not, please tell, I'm not 100% sure of this.
My problem is related to how I partitioning data. I've found out that the program woks best when I launch only 1024 worker threads (only 16 wavefronts). If I use all the 2048, it will be much slower. Recently I found out in the opencl manual that it has a serious channel-conflict.
My data format was like: uint bigLimbs[NumberOfWorkerThreads, NumberOfBignums, LimbsPerBignum] (where NumberOfBignums is the available big 'registers' for each worker-threads, for example 5 bignums are enough for a Mandelbrot calculation, NumberOfWorkerThreads: this could be 6K at maximum so every CU can have 3x wavefronts in queue, LimbsPerBignum can be 12, 24, 36, ...infinity based on the required precision)
I know this is wrong, and I'm planning to do 256 byte aligned reads in all the Compute Units. For this I will reorder the indices of the array like -> uint bigLimbs[NumberOfBignums, LimbsPerBignum,NumberOfWorkerThreads] this way I can do 64*4 byte 'burst' reads for all the 64 workitems of all the CUes. I hope this way I can get near to peak ALU and peak MEM bandwith.
But my real question is this (thank you if you've got here already 😞
There are different kinds of memory read instructions based on element size, which would be the better for reading 6x dwords?
- using 6x tbuffer_load_format_x (each of them reading 256bytes aligned, stride between threads is 4)
- using 2x tbuffer_load_format_xyz (each of them reads 768bytes, aligned to 256byte boundaries, stride between threads is 12)
The second one seems like more optimal, but what would be the performance difference if I rearrange the bigLimbs array to this:
uint bigLimbs[NumberOfBignums, LimbsPerBignum/3,NumberOfWorkerThreads,3] (3 is for format_xyz)
Using 2x tbuffer_load_format_xyz instead of 6x tbuffer_load_format_x is only wastes 1% alu performance, but what do you think, is it worth to use the bigger 768byte reads, with that complicated memory rearrangement (grouping every 3*32bits to use format xyz)?
Thank You in advance.
Here's the CAL kernel: http://x.pgy.hu/~worm/het/fcu.elf (It's generated stuff so not that meaningful)
And here's the inner loop, I'm talking about: http://x.pgy.hu/~worm/het/fcu_inner_loop.txt
You can see the massive reads at the beginning followed by the MULs and ADDs.
Now I'm trying to reorganize memory in a way that each read instruction will access a contiguous memory area. And I hope this inner loop will react max ALU utilization with 264GB/s reads.
Now that I've finally remake the thing, so every Compute Unit reaches memory in aligned contiguous 256byte blocks. Now the whole thing calculates 1.5x faster, and the estimated ALU utilization is 1/32.
The best performance comes when I use only 512 threads (on the 7970). 1024 threads are slower, 2048 threads are even slower. No way to hide memory latencies this way.
The algorhytm has an unusual memory access patters: when processing a whole number, simultaneously there is one pointer that goes upward in memory, and another that goes in to the opposite way. At both pointers I read a large contiguous chunk of 6*4*threadCount = 6*4*512= 12KBytes (at least 2048byte aligned). Maybe the cache system goes crazy from this...
There is an s_wait memcnt(0) after the 12KB reads (issued with 6x tbuffer_load_format_x), and finally a big computation block follows: 2*6dword reads followed by 576+ v_instructions (ratio is at least 12 ops/byte)
I curious what happens when I bypass the cache O.o
Edit: Nothing happened
Please anyone have a clue? The topic is still easy as multiplying terrible big numbers, but the implementation on 7970 is kinda frustrating.