5 Replies Latest reply on Nov 20, 2011 8:47 AM by Meteorhead

    It would be really nice if you could . . .

      . . . use an int4 as a subscript to a float4, and have it work element-wise....

      It does seem that it's not possible, but:

      I have a float4 calculation, which I must break down into subelements and process individually, s0->s3, because I'm using an array in the calc and the subscripts for the array differ between the float4 elements.

      I kinda knew it wasn't possible, but I did attempt anyway,  to perform the whole calc as a single float4 using an int4 as the subscript to a float4 variable (hoping that the float4.s0 calc would use the int4.s0, the float4.s1 would use the int4.s1, et cetera.)  When I tried it, of course, I got an error message complaining that the array subscript was not an int.

      Does anyone else think this would be a good idea to be able to do? Any general ideas as to what to try, from someone who's been here before?  It's not critical; my app works; I'd just like to Squeeze a leeeetle bit more performance out of it....

        • It would be really nice if you could . . .

          I'm not sure what your referring to..could you post the line of code for the calculation?

          I dont think you can load independent element data values as a vector operation. I assume vectors must be packed as this is how most hardware works.

          The only way to do it in one line of code (if I'm correct) would be to do:

          (float4) (floatarray[intvar.s0],floatarray[intvar.s1],floatarray[intvar.s2],floatarray[intvar.s3])

          this should work however it won't offer the same efficiency of loading a single packed vector, even with flexibility of VLIW

            • It would be really nice if you could . . .

              Here's what I'm doing (code attached).  The tweens must be done individually, but then I want to be as efficient with the subsequent calc as I can.

              Three chunks of code: the first, each of the f4 elements treated as a scalar; the second, assigning intermediate variables to enable a single f4 calc; the third, the fantasy I attempted knowing full well it wouldn't work.

              BUT, I will try your suggestion, replacing each of the k_Hval[] references with a packed vector just as you describe.  It may be more  efficient than what's in code chunk 2.

              Thank you!


              edit: missed something in the mashup, toutemp in the first block was a temp var which disappeared, but you get my gist.  also, i apologize for the too-long lines that wrap; they looked ok in the attach-code box.  i don't know how to edit attached code....  OOPS and k_Tval is an array similar to k_Hval

              __constant float k_Hval[] = { .000, .0583, .0833, .108, .125, .150, .167, .183, .222, .255, .367, .422, .464, .486, .522, .544, .589, .619, .706, .744, .797, .867, .919, .944, 1.00 }; #define tween(a,b) {i=0;while(a>b[i])i++;} void k_Hue_Convert( float4 inval, float4 *Toutval ) { int i=0; float4 Hvi; float4 Hvim1; float4 Tvi; float4 Tvim1; /* tween( inval.s0, k_Hval ); toutemp.s0 = ( inval.s0 - k_Hval[i-1] ) / ( k_Hval[i] - k_Hval[i-1] ) * ( k_Tval[i] - k_Tval[i-1] ) + k_Tval[i-1]; // individually tween( inval.s1, k_Hval ); toutemp.s1 = ( inval.s1 - k_Hval[i-1] ) / ( k_Hval[i] - k_Hval[i-1] ) * ( k_Tval[i] - k_Tval[i-1] ) + k_Tval[i-1]; // calculated, tween( inval.s2, k_Hval ); toutemp.s2 = ( inval.s2 - k_Hval[i-1] ) / ( k_Hval[i] - k_Hval[i-1] ) * ( k_Tval[i] - k_Tval[i-1] ) + k_Tval[i-1]; // not very tween( inval.s3, k_Hval ); toutemp.s3 = ( inval.s3 - k_Hval[i-1] ) / ( k_Hval[i] - k_Hval[i-1] ) * ( k_Tval[i] - k_Tval[i-1] ) + k_Tval[i-1]; // efficient */ tween( inval.s0, k_Hval ); Hvi.s0 = k_Hval[i]; Hvim1.s0 = k_Hval[i-1]; Tvi.s0 = k_Tval[i]; Tvim1.s0 = k_Tval[i-1]; // assign intermediate tween( inval.s1, k_Hval ); Hvi.s1 = k_Hval[i]; Hvim1.s1 = k_Hval[i-1]; Tvi.s1 = k_Tval[i]; Tvim1.s1 = k_Tval[i-1]; // values first tween( inval.s2, k_Hval ); Hvi.s2 = k_Hval[i]; Hvim1.s2 = k_Hval[i-1]; Tvi.s2 = k_Tval[i]; Tvim1.s2 = k_Tval[i-1]; // then a single f4 calc, tween( inval.s3, k_Hval ); Hvi.s3 = k_Hval[i]; Hvim1.s3 = k_Hval[i-1]; Tvi.s3 = k_Tval[i]; Tvim1.s3 = k_Tval[i-1]; // more efficient *Toutval = ( inval - Hvim1 ) / ( Hvi - Hvim1 ) * ( Tvi - Tvim1 ) + Tvim1; /* int4 i4; // (can't work, right?) tween( inval.s0, k_Hval ); i4.s0 = i; // assign tween( inval.s1, k_Hval ); i4.s1 = i; // just tween( inval.s2, k_Hval ); i4.s2 = i; // the tween( inval.s3, k_Hval ); i4.s3 = i; // "i"s ... then one f4 calc: *Toutval = ( inval - k_Hval[i4-1] ) / ( k_Hval[i4] - k_Hval[i4-1] ) * ( k_Tval[i4] - k_Tval[i4-1] ) + k_Tval[i4-1]; */ }

                • It would be really nice if you could . . .

                  why not working with ints and convert the result to float ?

                    • It would be really nice if you could . . .

                      No, the problem with the code is that you cannot address a vector array with a vector. I haven't checked specs, but this makes sense as four subscripts (int4) for a 4-element vector could theoretically point to 16-elements.

                      I think what I suggested may be the only way to work...I don't know enought about VLIW architecture, but the compiler may be able to schedule each compute unit to lookup scalar values independently, in which case the only cost would be lack of memory co-herency - but if it can't then there's not much else you can do - with GPGPU shifting to SIMT/MIMD in the future this will be become less of a problem.

                      The only other way would be to have a four dimensional array of data that accounts for every possible k_Hval[] combination for subscripts inval.s0-s3. Then combine elements like: inval.s0 | (inval.s1<<5)...etc to create a single integer subscript. This would only work for small arrays as memory requirements would be arraysize^4 which may not fit in constant memory.

                        • It would be really nice if you could . . .

                          This is not possible, because it goes against the OpenCL standard, thus this will never be supported. What you might want to do, is write a simple extract function, that does this for you.

                          float4 extract(__constant float* source, int4 index) { float4 result; result.x = source[index.x]; result.y = source[index.y]; result.z = source[index.z]; result.w = source[index.w]; return result; }