Archives Discussions

malcolm3141 · ‎02-24-2010

When compiling code using the convert_int4(uchar4) function, I get rather redundant GPU assembly from it. Compiling this:

uchar4 b = ...;

int4 c0 = convert_int4(b) << 4;

c0 += (int4)(1, 2, 3, 4);

Becomes this:

01 TEX: ADDR(160) CNT(1)

10 VFETCH R0.x___, R0.y, fc156 MEGA(4)

FETCH_TYPE(NO_INDEX_OFFSET)

02 ALU: ADDR(56) CNT(89)

11 x: ASHR ____, R0.x, (0x00000010, 2.242077543e-44f).x

y: ASHR ____, R0.x, (0x00000008, 1.121038771e-44f).y

z: MOV ____, R0.x

w: ASHR ____, R0.x, (0x00000018, 3.363116314e-44f).z

12 x: LSHL ____, PV11.x, (0x00000018, 3.363116314e-44f).x

y: LSHL ____, PV11.w, (0x00000018, 3.363116314e-44f).x

z: LSHL ____, PV11.y, (0x00000018, 3.363116314e-44f).x

w: LSHL ____, PV11.z, (0x00000018, 3.363116314e-44f).x

13 x: ASHR ____, PV12.x, (0x00000018, 3.363116314e-44f).x

y: ASHR ____, PV12.y, (0x00000018, 3.363116314e-44f).x

z: ASHR ____, PV12.z, (0x00000018, 3.363116314e-44f).x

w: ASHR ____, PV12.w, (0x00000018, 3.363116314e-44f).x

14 x: AND_INT ____, PV13.x, (0x000000FF, 3.573311084e-43f).x

y: AND_INT ____, PV13.y, (0x000000FF, 3.573311084e-43f).x

z: AND_INT ____, PV13.z, (0x000000FF, 3.573311084e-43f).x

w: AND_INT ____, PV13.w, (0x000000FF, 3.573311084e-43f).x

15 x: LSHL ____, PV14.x, (0x00000018, 3.363116314e-44f).x

y: LSHL ____, PV14.y, (0x00000018, 3.363116314e-44f).x

z: LSHL ____, PV14.z, (0x00000018, 3.363116314e-44f).x

w: LSHL ____, PV14.w, (0x00000018, 3.363116314e-44f).x

16 x: ASHR ____, PV15.x, (0x00000018, 3.363116314e-44f).x

y: ASHR ____, PV15.y, (0x00000018, 3.363116314e-44f).x

z: ASHR ____, PV15.z, (0x00000018, 3.363116314e-44f).x

w: ASHR ____, PV15.w, (0x00000018, 3.363116314e-44f).x

17 x: LSHL ____, PV16.x, (0x00000018, 3.363116314e-44f).x

y: LSHL ____, PV16.z, (0x00000018, 3.363116314e-44f).x

z: LSHL ____, PV16.y, (0x00000018, 3.363116314e-44f).x

w: LSHL ____, PV16.w, (0x00000018, 3.363116314e-44f).x

18 x: LSHR ____, PV17.x, (0x00000018, 3.363116314e-44f).x

y: LSHR ____, PV17.y, (0x00000018, 3.363116314e-44f).x

z: LSHR ____, PV17.w, (0x00000018, 3.363116314e-44f).x

w: LSHR ____, PV17.z, (0x00000018, 3.363116314e-44f).x

19 x: LSHL ____, PV18.x, (0x00000004, 5.605193857e-45f).x

y: LSHL ____, PV18.w, (0x00000004, 5.605193857e-45f).x

z: LSHL ____, PV18.y, (0x00000004, 5.605193857e-45f).x

w: LSHL ____, PV18.z, (0x00000004, 5.605193857e-45f).x

20 x: ADD_INT ____, PV19.x, (0x00000003, 4.203895393e-45f).x

y: ADD_INT ____, PV19.y, (0x00000004, 5.605193857e-45f).y

z: ADD_INT ____, PV19.z, (0x00000002, 2.802596929e-45f).z

w: ADD_INT ____, PV19.w, 1

You can see that the conversion is actually complete after executing 11, 12, and 13 (if 13 was LSHR). The instructions for 14 - 18 are redundant.

Also, on a separate topic, why are 24bit integer instructions not exposed in IL. Am I correct in believing 32bit multiplies are restricted to the t pipe, but 24bit can go in x,y,z,w pipe? If so, then using mad24(...) and mul24(...) could provide some significant efficiencies!

Malcolm

malcolm3141 · ‎02-25-2010

Found the answer to my own problem... The redundancy is due to the uchar4 type being loaded from global memory. The OpenCL compiler converts it to an int4 after loading. Then when I call the convert_int4 it performs another redundant conversion.

This highlights a potential optimisation in the OpenCL compiler, that being removing redundant conversions when dealing with types smaller than int.

My solution at present is to do all loads and stores as uint, uint2, or uint4 types and write my own packing and unpacking routines...

Malcolm

MicahVillmow · ‎02-25-2010

malcolm,
These are both known issues and are things we are working to improve. The hardware only natively supports 32bit and 64bit scalar types and some of the vector versions. Anything else is going to be inefficient as it isn't natively supported and needs to be converted to native types before being used. This will cause some redundant operations in various cases where the compiler is attempting to guarantee that the state is always valid, even though not the most efficient.

Archives Discussions

OpenCL compilation redundancy