Archives Discussions

jzuber · ‎10-03-2012

I am trying to increase throughput by using AMD GPGPU and OpenCL, and I am getting less performance than a CPU using a loop. Also take note: this is my first openCL kernel so I understand if it is complete garbage.

Essentially I get large amounts of data, and for each of those data objects, I need to run some conversion, then compare to a 512*7 sized look-up table, and do a sum of squares of each of those 7 values (I will illustrate below). Currently I want to see the gain I can get with a less powerful GPGPU (Radeon 6570), with the hopes that any gain at all given by this will only get better as the GPGPU I select gets better.

Here is my Kernel code:

#define LUT_0_1_TABLE_ROW_SIZE 512

#define LUT_2_TABLE_COL_SIZE 12

#define LUT_2_TABLE_ROW_SIZE 207

#define LUT_2_TABLE_SIZE LUT_2_TABLE_COL_SIZE*LUT_2_TABLE_ROW_SIZE

struct object

{

float m_fval0;

float m_fval1;

float m_fval2;

float m_fval3;

float m_fval4;

float m_fval5;

float m_fval6;

float m_fval7;

unsigned int m_uval0;

unsigned int m_uval1;

unsigned int m_uval2;

unsigned int m_uval3;

unsigned int m_uval4;

};

ushort getLut2TableIndex(float fin0, float fin1)

{

float fMin;

ushort tableIndex = (ushort)(fin1/7); // use tempIndex as a temp var.....Hacks!!

ushort minIdx = min(max(tableIndex, (ushort)0), (ushort)7);

tableIndex = (minIdx == 1 ? 2 : minIdx); // compensate for the fact that the "1" position doesn't exist

tableIndex *= 26;

tableIndex -= (minIdx < 2 ? 1 : 0); // compensation

fMin = (fin0- minIdx);

tableIndex += (ushort)fMin;

// Make sure the idx is in a valid range - otherwise we can crash!

tableIndex = min(max(minIdx, (ushort)0), (ushort)(LUT_1_TABLE_ROW_SIZE - 1)); // clamp returns the value inside the min and max

// not using clamp because of using openCL 1.0 -> final release will use openCL 1.1....

return tableIndex;

}

__kernel void convertPdwsAndPerformAoa(global uint8 * input0, global uint8 * input1, global uint8 * input2, global uint8 * input3,

global struct object * output, constant float * lut0, constant float4 * lut1,

constant float * lut2, float fin0, local float4 * lut1Local)

{

float ftemp0 = -9.3f;

float ftemp1 = 12.0f;

int itemp0 = 0;

float tempLog = 0.0f;

float4 f4temp0 = { 0.0f, 0.0f, 0.0f, 0.0f };

float4 f4temp1 = { 0.0f, 0.0f, 0.0f, 0.0f };

float8 f8temp0 = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };

float8 f8temp1 = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };

uint8 input = input3[gid];

struct object tempOut;

ushort lut2ColGroupIndex = 0;

ushort lut2RowIdx;

ushort lut0RowIdx = 0;

float sumOfSquares;

float minSum = MAXFLOAT;

uint gid = get_global_id(0);

uint lid = get_local_id(0);

uint nloc = get_local_size(0);

// Copy the lut1 global vector into a local vector

for (unsigned int i = lid; i < LUT_0_1_TABLE_ROW_SIZE; i += nloc)

{

lut1Local = lut1;

}

barrier(CLK_LOCAL_MEM_FENCE);

// Conversion code -> this is fast no need to look at!

// .................

// **************************

lut2RowIdx = getLut2TableIndex(tempOut.m_fval0, fin0);

f4temp0.s0 = f4temp0.s0/f4temp1.s0;

f4temp0.s1 = f4temp0.s1/f4temp1.s1 + lut2[lut2RowIdx + 3*lut2ColGroupIndex];

f4temp0.s2 = f4temp0.s2/f4temp1.s2 + lut2[lut2RowIdx + 3*lut2ColGroupIndex + 1];

f4temp0.s3 = f4temp0.s3/f4temp1.s3 + lut2[lut2RowIdx + 3*lut2ColGroupIndex + 2];

f8temp0.s0 = f4temp0.s0 - f4temp0.s1;

f8temp0.s1 = f4temp0.s0 - f4temp0.s2;

f8temp0.s2 = f4temp0.s0 - f4temp0.s3;

f8temp0.s3 = f4temp0.s1 - f4temp0.s2;

f8temp0.s4 = f4temp0.s2 - f4temp0.s3;

f8temp0.s5 = f4temp0.s1 - f4temp0.s3;

f8temp0.s6 = f8temp0.s3 - f8temp0.s4;

// Calculate the cost function

for (unsigned short idx = 0; idx < LUT_0_1_TABLE_ROW_SIZE; ++idx)

{

// Calculate the lut1 components

f8temp1.s0 = lut1Local[idx].s0*tempOut.m_fval0;

f8temp1.s1 = lut1Local[idx].s1*tempOut.m_fval0;

f8temp1.s2 = lut1Local[idx].s2*tempOut.m_fval0;

f8temp1.s3 = f8temp1.s0 - f8temp1.s1;

f8temp1.s4 = f8temp1.s0 - f8temp1.s2;

f8temp1.s5 = f8temp1.s0 - f8temp1.s3;

f8temp1.s6 = f8temp1.s3 - f8temp1.s4;

f8temp0 -= f8temp1;

f8temp1 += (f8temp1 > (float8)100.0f ? (float8)-5.0f : (float8)0.0f);

// Square all elements of the vector

f8temp1 *= f8temp1;

// Sum all elements of the vector together

sumOfSquares = f8temp1.s0;

sumOfSquares += f8temp1.s1;

sumOfSquares += f8temp1.s2;

sumOfSquares += f8temp1.s3;

sumOfSquares += f8temp1.s4;

sumOfSquares += f8temp1.s5;

sumOfSquares += f8temp1.s6;

// If the new sum is < the old min, then this is the new min! Update the idx and the minSum.

lut0RowIdx = (minSum < sumOfSquares ? lut0RowIdx : idx);

minSum = fmin(minSum, sumOfSquares);

}

lut0RowIdx = min(max(lut0RowIdx, (ushort)0), (ushort)(LUT_0_1_TABLE_ROW_SIZE - 1)); // clamp

tempOut.m_fin4 = lut0[lut0RowIdx];

output[gid] = tempOut;

}

Anyway I obfuscated my code a bit, so there may be some things which don't make a lot of sense, but I need to do all those operations. One option I was looking at was have one kernel to do the conversion, then another for the cost function -> though the cpu would have to loop through and run the cost function kernel for each of the intended objects.

Thanks for taking a look!

Rom1 · ‎10-04-2012

How large is your NDRange when you launch your kernel ?

One tip by i don't think it will be so much faster is to replace :

// Square all elements of the vector

f8temp1 *= f8temp1;
// Sum all elements of the vector together
sumOfSquares = f8temp1.s0;
sumOfSquares += f8temp1.s1;
sumOfSquares += f8temp1.s2;
sumOfSquares += f8temp1.s3;
sumOfSquares += f8temp1.s4;
sumOfSquares += f8temp1.s5;
sumOfSquares += f8temp1.s6;

By (sorry for Code mis formatting but the tool don't let me to copy your code easily )

f8temp1.s7=0; //(you don't use it ?)

sumOfSquares = dot(f8temp1,f8temp1);//This is more vectorial style

Another thing that can have very weird effect is that your struct "object" is not very well aligned since it have 13 * 4 octets it would be better to pad it up to 16 int (ie add int in to the struct)

jzuber · ‎10-05-2012

Thanks, I didn't think of that. I implemented that but saw no noticeable change in performance. I put it in because it looks way nicer though

I am using the AMD APP Kernel Analyzer and it says that I am limited by my use of VGPRs. I thought that Radeon GPGPUs were optimized to use vectors as opposed to scalars? Also is my understanding correct: I am using too many VGPRs? (or is it that I am not using them enough?? I am confused! lol)

Last night I reworked the code a bit, I will post later, but got approx a 30% improvement by making it a 2-D kernel where the loop is broken into smaller chunks. Now this does lead to a bit of waste as the conversion portion really only needs to be done for the threads belonging to the 1st dimension.

Will investigate tonight removing the VGPRs altogether!

-Jon

notzed · ‎10-08-2012

you can't remove the vgprs ... unless yuou removed the code. They are just the registers used by the vector processors to execute your algorithm.

Since LUT_0_1_TABLE_ROW_SIZE is so big, and all you're doing is sums/min, i'd be inclined to try flattening that out and running that in parallel - i.e. each workgroup does one output rather than each work item.

jzuber · ‎10-10-2012

So the number of VGPRs used has nothing to do with the local variables I have? I thought I had read somewhere that the local vars were registers that get used. I am probably misguided .

notzed · ‎10-10-2012

It's basic computer architecture of a load/store cpu.

registers are used for any arithmetic, the compiler will try to fit them to locals but there needn't be a 1:1 fit depending on register spillage (less than one register per variable), loop unrolling (more than one register per variable), and usage scope (they can be re-used).

*reducing* registers usage by having simpler code can help on a gpu, but you can't possibly "remove them altogether" since they are the only way to do arithmetic.

Rom1 · ‎10-10-2012

I fully agree with notzed. You should parallelized the LUT_0_1_TABLE_ROW_SIZE loop and reduce the min after. A cheap way (in developing time) to reduce is to allocate a table of float in local mem of size LUT_0_1_TABLE_ROW_SIZE to store sumOfSquaresand use only one thread to compute and store the min from the table(yes it's not true parallel reduction but you should try).

I thing that one of the two f8temp (f8temp1 and f8temp0) is not needed suppress can save some registers...

jzuber · ‎10-23-2012

I initially did the suggestion of using multidimensional kernel, local mem, and only the local_id(1) == 0 would calc the actual min sum. There was an increase in performance, 1.5 times faster. Next I have broken the kernel into three kernels. First is a A X 1 kernel which does the conversion. Second is a B X C kernel which goes through the loop and store "local" minimums in global arrays. Third is an A X 1 kernel which loops through the "local" minimums and finds the absolute minimum.

I was able to increase my occupancy of the "second" kernel from around 25% to 75% by tidying up many VGPRs (float8's and the output) and reusing many of them.

The first and third kernel run very quickly, however the second kernel is still slow(er than I need!). Here it is:

#define LUT_0_1_TABLE_ROW_SIZE 512

#define LUT_2_TABLE_COL_SIZE 12

#define LUT_2_TABLE_ROW_SIZE 207

#define LUT_2_TABLE_SIZE LUT_2_TABLE_COL_SIZE*LUT_2_TABLE_ROW_SIZE

struct object

{

float m_fval0;

float m_fval1;

float m_fval2;

float m_fval3;

float m_fval4;

float m_fval5;

float m_fval6;

float m_fval7;

unsigned int m_uval0;

unsigned int m_uval1;

unsigned int m_uval2;

unsigned int m_uval3;

unsigned int m_uval4;

};

__kernel void k2(global struct object * output, constant float * lut0, constant float4 * lut5, global float4 * scratch0,

global float4 * scratch1, global float * minSums, global uint * idxs)

{

float4 f4temp0 = { 0.0f, 0.0f, 0.0f, 0.0f };

float4 f4temp1 = { 0.0f, 0.0f, 0.0f, 0.0f };

float sumOfSquares;

uint tempIdx = 0;

float minSum = MAXFLOAT;

int iloc1 = get_local_id(1);

int nloc1 = get_local_size(1);

uint gid0 = get_global_id(0);

uint idx;

float tempVal = output[gid0].m_fval4;

// Calculate the cost function

for (idx = iloc1; idx < LUT_0_1_TABLE_ROW_SIZE; idx += nloc1)

{

// Calculate the lut5 components

f4temp1 = lut5[idx];

f4temp0.s0 = f4temp1.s0 * tempVal;

f4temp0.s1 = f4temp1.s1 * tempVal;

f4temp0.s2 = f4temp1.s2 * tempVal;

f4temp1.s0 = f4temp0.s1 - f4temp0.s0;

f4temp1.s1 = f4temp0.s2 - f4temp0.s1;

f4temp1.s2 = f4temp0.s2 - f4temp1.s0;

f4temp1.s3 = f4temp1.s0 - f4temp1.s1;

f4temp0 = scratch0[idx] - f4temp0;

f4temp1 = scratch1[idx] - f4temp1;

f4temp0.s3 = 0;

// Square then sum all elements of the vector together

sumOfSquares = dot(f4temp0, f4temp0);

sumOfSquares += dot(f4temp1, f4temp1);

// If the new sum is < the old min, then this is the new min! Update the idx and the minSum.

tempIdx = (minSum < sumOfSquares ? tempIdx : idx);

minSum = fmin(minSum, sumOfSquares);

}

idxs[iloc1 + gid0*nloc1] = tempIdx;

minSums[iloc1 + gid0*nloc1] = minSum;

}

Now when I run the CodeXL profiling the major limitation now seems to be ALU Packing (its down around 25%). I have read that it means that my code is not written to fully utilize AMDs 5-way VLIW execution units, though I am unsure how to resolve this. Seeing as the peak performance of the card I'm using is ~500 GFLOPs, I feel like I am doing something way wrong (and no I am not aiming to get 500 GFlops, but I need this to be fast!).

Something I am thinking of trying, but do not know if it will help, is to set each loop to perform operations for 4 iterations at a time. Thus doing the exact same thing to 4 idxs at a time. I am not sure if this will help, any comments would be appreciated.

Thanks to all who have commented so far!

-Jon

notzed · ‎10-24-2012

Your inner loop is too big, for a function that could be paralellised fairly easily (sum, and min).

This can all be parallelised - i'll give a cut-down example of how it works.

Assuming you had say 500 arrays of 512 elements and you wanted to find the minimum square of of each array and identify the element it was. You could do something like above where each of 500 work items is calculating the minimum of all 512 items, your global worksize would be '500' and your local would be anything.

e.g. kernel is:

int gid = get_global_id(0)

for (i=0;i<512;i++) {

v = squared value[gid * 512 + i]

mini = v < min ? i: mini;

min = v < min ? v : min;

}

store result

But this is slow for 2 reasons: a long loop, and the very sparse access pattern. Each next-work-item is accessing memory sparsely related to the previous work-item in the same group so you could end up with 1/32 (or 1/16?) of the potential bandwidth just from that. i.e. work items 0, 1, 2 will try to load memory at offset 0, 512, 1024 *at the same time*, which of course they can't, so you end up with 3 separate fetches. (you don't specify B and C, but i'm presuming B is something > 16 and C is > 1).

Instead you can "turn it sideways" and execute the inner loop in parallel. You could set the local work size to the width of the problem - but that isn't very general, and for various other reasons you would tend to use 64 (or some small multiple) instead. And then you do it like the following. In this case your global work size is set to 64 * 500, and your local work size is set to 64.

// this is important, unless your x dimension is 1, always use x dimension as the 'local index'

local float minval[64]

local int minint[64]

int lid = get_local_id(0);

int gid = get_global_id(0);

while (lid < 512) {

v = squared value[gid * 512 + lid]

mini = ...

min = ...

lid += 64

}

store min/mini at index lid in a local arrays

barrier local

parallel reduce the local array minval

barrier local

if (lid == 0) {

store result(s)

}

So now you get 1 result per work-group, rather than 1 per work-item, and you're reducing the amount of alu ops by almost 64x (i.e. 8 inner loops rather than 512, plus the reduction stuff at the end) at the expense of needing to launch 64x more threads, and needing the parallel reduce step (search for a description of that: it's a very basic operation on a gpu like a for loop is in c). The memory access is 'optimal' as well, e.g. work items 0, 1, 2 will try to access memory at offsets 0, 1, 2 at the same time - which can be serviced in a single memory transaction.

It might seem like a lot of threads - but GPU's absolutely love threads. They just eat them up and spit them out and then go back for secondses - it's the (only) way they hide memory latency and gain their efficiency. The local memory and barriers you need to use them this way are a pain, but a necessary evil to getting good performance.

For maximum flops you also want to try to get it so that sequentially increasing work-items in the same work-group access sequentially increasing and adjacent memory locations at the same time too (this is not a strict requirement, it's just the easiest to describe and usually code).

Archives Discussions

How can I optimise my algorithm?