cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

malcolm3141
Journeyman III

OpenCL compilation redundancy

When compiling code using the convert_int4(uchar4) function, I get rather redundant GPU assembly from it. Compiling this:

 

 

uchar4 b = ...;

int4 c0 = convert_int4(b) << 4;

c0 += (int4)(1, 2, 3, 4);


Becomes this:


01 TEX: ADDR(160) CNT(1) 

     10  VFETCH R0.x___, R0.y, fc156  MEGA(4) 

         FETCH_TYPE(NO_INDEX_OFFSET) 

02 ALU: ADDR(56) CNT(89) 

     11  x: ASHR        ____,  R0.x,  (0x00000010, 2.242077543e-44f).x      

         y: ASHR        ____,  R0.x,  (0x00000008, 1.121038771e-44f).y      

         z: MOV         ____,  R0.x      

         w: ASHR        ____,  R0.x,  (0x00000018, 3.363116314e-44f).z      

     12  x: LSHL        ____,  PV11.x,  (0x00000018, 3.363116314e-44f).x      

         y: LSHL        ____,  PV11.w,  (0x00000018, 3.363116314e-44f).x      

         z: LSHL        ____,  PV11.y,  (0x00000018, 3.363116314e-44f).x      

         w: LSHL        ____,  PV11.z,  (0x00000018, 3.363116314e-44f).x      

     13  x: ASHR        ____,  PV12.x,  (0x00000018, 3.363116314e-44f).x      

         y: ASHR        ____,  PV12.y,  (0x00000018, 3.363116314e-44f).x      

         z: ASHR        ____,  PV12.z,  (0x00000018, 3.363116314e-44f).x      

         w: ASHR        ____,  PV12.w,  (0x00000018, 3.363116314e-44f).x      

     14  x: AND_INT     ____,  PV13.x,  (0x000000FF, 3.573311084e-43f).x      

         y: AND_INT     ____,  PV13.y,  (0x000000FF, 3.573311084e-43f).x      

         z: AND_INT     ____,  PV13.z,  (0x000000FF, 3.573311084e-43f).x      

         w: AND_INT     ____,  PV13.w,  (0x000000FF, 3.573311084e-43f).x      

     15  x: LSHL        ____,  PV14.x,  (0x00000018, 3.363116314e-44f).x      

         y: LSHL        ____,  PV14.y,  (0x00000018, 3.363116314e-44f).x      

         z: LSHL        ____,  PV14.z,  (0x00000018, 3.363116314e-44f).x      

         w: LSHL        ____,  PV14.w,  (0x00000018, 3.363116314e-44f).x      

     16  x: ASHR        ____,  PV15.x,  (0x00000018, 3.363116314e-44f).x      

         y: ASHR        ____,  PV15.y,  (0x00000018, 3.363116314e-44f).x      

         z: ASHR        ____,  PV15.z,  (0x00000018, 3.363116314e-44f).x      

         w: ASHR        ____,  PV15.w,  (0x00000018, 3.363116314e-44f).x      

     17  x: LSHL        ____,  PV16.x,  (0x00000018, 3.363116314e-44f).x      

         y: LSHL        ____,  PV16.z,  (0x00000018, 3.363116314e-44f).x      

         z: LSHL        ____,  PV16.y,  (0x00000018, 3.363116314e-44f).x      

         w: LSHL        ____,  PV16.w,  (0x00000018, 3.363116314e-44f).x      

     18  x: LSHR        ____,  PV17.x,  (0x00000018, 3.363116314e-44f).x      

         y: LSHR        ____,  PV17.y,  (0x00000018, 3.363116314e-44f).x      

         z: LSHR        ____,  PV17.w,  (0x00000018, 3.363116314e-44f).x      

         w: LSHR        ____,  PV17.z,  (0x00000018, 3.363116314e-44f).x      

     19  x: LSHL        ____,  PV18.x,  (0x00000004, 5.605193857e-45f).x      

         y: LSHL        ____,  PV18.w,  (0x00000004, 5.605193857e-45f).x      

         z: LSHL        ____,  PV18.y,  (0x00000004, 5.605193857e-45f).x      

         w: LSHL        ____,  PV18.z,  (0x00000004, 5.605193857e-45f).x      

 

 

     20  x: ADD_INT     ____,  PV19.x,  (0x00000003, 4.203895393e-45f).x      

         y: ADD_INT     ____,  PV19.y,  (0x00000004, 5.605193857e-45f).y      

         z: ADD_INT     ____,  PV19.z,  (0x00000002, 2.802596929e-45f).z      

         w: ADD_INT     ____,  PV19.w,  1      

 

You can see that the conversion is actually complete after executing 11, 12, and 13 (if 13 was LSHR). The instructions for 14 - 18 are redundant.


Also, on a separate topic, why are 24bit integer instructions not exposed in IL. Am I correct in believing 32bit multiplies are restricted to the t pipe, but 24bit can go in x,y,z,w pipe? If so, then using mad24(...) and mul24(...) could provide some significant efficiencies!


Malcolm

 



0 Likes
2 Replies
malcolm3141
Journeyman III

Found the answer to my own problem... The redundancy is due to the uchar4 type being loaded from global memory. The OpenCL compiler converts it to an int4 after loading. Then when I call the convert_int4 it performs another redundant conversion.

This highlights a potential optimisation in the OpenCL compiler, that being removing redundant conversions when dealing with types smaller than int.

My solution at present is to do all loads and stores as uint, uint2, or uint4 types and write my own packing and unpacking routines...

 

Malcolm

0 Likes

malcolm,
These are both known issues and are things we are working to improve. The hardware only natively supports 32bit and 64bit scalar types and some of the vector versions. Anything else is going to be inefficient as it isn't natively supported and needs to be converted to native types before being used. This will cause some redundant operations in various cases where the compiler is attempting to guarantee that the state is always valid, even though not the most efficient.
0 Likes