cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

kbrafford
Adept II

Wow..what did I really do when I switched to native_sin and native_cos?

Stream KernelAnalyzer says that my kernel went from 158 million kernels per second to 222 million kernels per second when I switched one cos and one sine to native_cos and native_sin

The weird part is that the basic structure of the code is this:

1. setup, including a call to cos(float4)

2. loop doing 256 iterations

3. teardown, including a call to sin(float4)

All I did was change the two trig functions.  Why does the SKA tool think I've invented a new sliced bread?  The wall clock certainly doesn't agree with that.

--Keith Brafford

0 Likes
19 Replies

sin/cos are required to fulfill the LLVM requirements for OpenCL floating point accuracy. Native_* versions have no such restraints and compile down to a single hardware instruction. The trig functions are not single instruction functions in most cases.
0 Likes

Would I be better off calculating the cos and sin values on the CPU and passing them into the kernels?

I am having a hard time understanding why the SKA tool predicts such a massive speed-up that simply doesn't happen in reality.

0 Likes

If your kernel is not ALU bound, it won't speed up by decreasing the amount of ALU to compute the results.
0 Likes
eugenek
Journeyman III

native_cos with float argument is 1 hardware instruction. cos with float argument is around 200 instructions if the argument is less than 3140000, and closer to 300 instructions otherwise.

 

Would I be better off calculating the cos and sin values on the CPU and passing them into the kernels?


Not really. The CPU takes an awful lot of time to compute sin/cos as well. The GPU can manage about 4 billion cosines per second. The CPU can do 200 million cosines per second.

0 Likes

Stand back in awe of the attached code, which results in 142 ALU instructions, 10 fetches, 3 writes, 62 GPRs and 2 scratch registers on HD5870 

kernel void test(global float *A, global float *B) { int pos = get_global_id(0); B[pos] = powr(A[pos], A[pos + 1]); }

0 Likes

Jawed,

Where did you got those numbers? I tried your code in SKA and get 178 Instruction clauses and it uses 11 GPRs. With native version it is 11 clauses and 3 GPRs. The number of reads are 2 and write is 1 in both cases as expected.

0 Likes

SKA 1.7 with SDK 2.3 installed.

I suspect you have a different SDK installed.

If I use native_powr() then I get 10 ALUs and 3 GPRs (I suspect you meant 10 not 11).

Why isn't native_pow() defined?

0 Likes

Originally posted by: himanshu.gautam Jawed,

 

Where did you got those numbers? I tried your code in SKA and get 178 Instruction clauses and it uses 11 GPRs. With native version it is 11 clauses and 3 GPRs. The number of reads are 2 and write is 1 in both cases as expected.

 

 

Maybe you're compiling for a different GPU? I also see 62 GPRs and 10 fetches on a 5870, 63 GPRs and 8 fetches on a 6970.

 

Your CPU may be doing something similar when you ask it to compute a power, it's just not so blatantly obvious. On an Intel Core Duo, the two instructions that do the bulk of the job inside pow() take 165 clock ticks.

0 Likes

Can you post the IL\ISA generated without native function.

0 Likes

here

kernel void test(global float *A, global float *B) { int pos = get_global_id(0); B[pos] = powr(A[pos], A[pos + 1]); } ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(5) KCACHE0(CB0:0-15) 0 t: MULLO_INT ____, R1.x, KC0[1].x 1 y: ADD_INT ____, R0.x, PS0 2 w: ADD_INT ____, PV1.y, KC0[6].x 3 x: LSHL R0.x, PV2.w, (0x00000002, 2.802596929e-45f).x 01 MEM_SCRATCH_WRITE_ACK: VEC_PTR[0].x___, R0, ARRAY_SIZE(1) ELEM_SIZE(3) MARK 02 ALU: ADDR(37) CNT(5) KCACHE1(CB1:0-15) 4 x: MOV R0.x, R0.x 5 z: ADD_INT ____, KC1[0].x, PV4.x 6 y: LSHR R0.y, PV5.z, (0x00000002, 2.802596929e-45f).x w: ADD_INT R0.w, PV5.z, (0x00000004, 5.605193857e-45f).y 03 WAIT_ACK: Outstanding_acks <= 0 04 TEX: ADDR(512) CNT(1) 7 VFETCH R51.x___, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 05 ALU: ADDR(42) CNT(66) 8 x: AND_INT R50.x, (0x7FFFFFFF, 1.#QNANf).x, R51.x z: LSHR R0.z, R0.w, (0x00000002, 2.802596929e-45f).y 9 y: ADD R3.y, -PV8.x, 1.0f z: OR_INT ____, PV8.x, (0x3F800000, 1.0f).x w: LSHR T0.w, PV8.x, (0x00000017, 3.222986468e-44f).y 10 x: MULADD_e ____, PV9.y, (0x3E124925, 0.1428571492f).y, (0x3E2AAAAB, 0.1666666716f).x y: ADD ____, PV9.z, -1.0f z: ADD_INT T0.z, (0xFFFFFF81, -1.#QNANf).z, PV9.w 11 x: CNDE_INT ____, T0.w, PV10.y, R50.x y: LSHR ____, PV10.y, (0x00000017, 3.222986468e-44f).x w: MULADD_e ____, R3.y, PV10.x, (0x3E4CCCCD, 0.200000003f).y 12 x: AND_INT ____, PV11.x, (0x007FFFFF, 1.175494211e-38f).x y: MULADD_e ____, R3.y, PV11.w, (0x3E800000, 0.25f).y z: LSHL ____, PV11.x, 1 w: AND_INT T1.w, PV11.x, (0x007F0000, 1.166310801e-38f).z t: ADD_INT ____, (0xFFFFFF03, -1.#QNANf).w, PV11.y 13 x: OR_INT T0.x, PV12.x, (0x3F000000, 0.5f).x y: AND_INT ____, PV12.z, (0x00010000, 9.183549616e-41f).y z: MULADD_e T0.z, R3.y, PV12.y, (0x3EAAAAAB, 0.3333333433f).z w: CNDE_INT ____, T0.w, PS12, T0.z 14 x: ADD_INT ____, T1.w, PV13.y t: I_TO_F R8.z, PV13.w 15 x: LSHR R3.x, PV14.x, (0x0000000E, 1.961817850e-44f).x w: OR_INT ____, PV14.x, (0x3F000000, 0.5f).y 16 x: ADD R7.x, PV15.w, -T0.x y: ADD_INT ____, PV15.x, (0x000007E0, 2.825017704e-42f).x z: ADD_INT ____, PV15.x, (0x00000320, 1.121038771e-42f).y w: ADD_INT ____, PV15.x, (0x00000530, 1.860924361e-42f).z t: MUL_e T0.y, R3.y, R3.y 17 x: MUL_e ____, R3.y, PS16 y: ASHR ____, PV16.z, (0x00000002, 2.802596929e-45f).x z: ASHR ____, PV16.w, (0x00000002, 2.802596929e-45f).x w: ASHR T1.w, PV16.y, (0x00000002, 2.802596929e-45f).x t: AND_INT T1.z, (0x7FFFFFFF, 1.#QNANf).y, R3.y 18 x: AND_INT ____, PV17.y, (0x00000003, 4.203895393e-45f).x y: ASHR R0.y, PV17.y, (0x00000002, 2.802596929e-45f).y z: AND_INT T0.z, PV17.z, (0x00000003, 4.203895393e-45f).x w: MUL_e R1.w, T0.z, PV17.x t: ASHR R0.x, PV17.z, (0x00000002, 2.802596929e-45f).y 19 x: ADD_INT R2.x, (0xFFFFFFFD, -1.#QNANf).x, PV18.x y: ADD_INT R1.y, (0xFFFFFFFE, -1.#QNANf).y, PV18.x z: ADD_INT R1.z, -1, PV18.x w: ADD_INT R0.w, -1, PV18.z t: ADD_INT R3.z, (0xFFFFFFFE, -1.#QNANf).y, PV18.z 20 x: ASHR R1.x, T1.w, (0x00000002, 2.802596929e-45f).x y: ADD_INT R2.y, (0xFFFFFFFD, -1.#QNANf).y, T0.z z: AND_INT R2.z, T1.w, (0x00000003, 4.203895393e-45f).z w: SETGT_DX10 R2.w, (0x3D800000, 0.0625f).w, T1.z VEC_021 t: MUL_e R7.z, T0.y, -0.5 06 VTX: ADDR(514) CNT(2) 21 VFETCH R4, R0.y, fc130 MEGA(16) FETCH_TYPE(NO_INDEX_OFFSET) 22 VFETCH R5, R0.x, fc130 MEGA(16) FETCH_TYPE(NO_INDEX_OFFSET) 07 TEX: ADDR(518) CNT(1) 23 VFETCH R49.x___, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 08 VTX: ADDR(520) CNT(1) 24 VFETCH R6, R1.x, fc130 MEGA(16) FETCH_TYPE(NO_INDEX_OFFSET) 09 ALU: ADDR(108) CNT(47) 25 x: CNDE_INT ____, R0.w, R5.y, R5.x y: ADD_INT ____, -1, R2.z z: ADD_INT T0.z, (0xFFFFFFFE, -1.#QNANf).x, R2.z w: CNDE_INT ____, R1.z, R4.y, R4.x VEC_021 t: ADD_INT T1.z, (0xFFFFFFFD, -1.#QNANf).y, R2.z 26 y: CNDE_INT ____, PV25.y, R6.y, R6.x z: CNDE_INT ____, R1.y, R4.z, PV25.w w: CNDE_INT ____, R3.z, R5.z, PV25.x VEC_021 27 x: CNDE_INT ____, T0.z, R6.z, PV26.y y: CNDE_INT ____, R2.x, R4.w, PV26.z z: CNDE_INT ____, R2.y, R5.w, PV26.w VEC_021 w: ADD T0.w, -R1.w, R7.z VEC_021 t: ASHR T0.x, R3.x, (0x00000002, 2.802596929e-45f).x 28 x: MUL_e T1.x, R7.x, PV27.y y: MUL_e T0.y, R7.x, PV27.z z: AND_INT ____, PS27, (0x00000003, 4.203895393e-45f).x w: CNDE_INT T1.w, T1.z, R6.w, PV27.x 29 x: ADD T0.x, PV28.x, PV28.y y: ADD_INT R6.y, -1, PV28.z z: ASHR R6.z, T0.x, (0x00000002, 2.802596929e-45f).x w: ADD_INT R6.w, (0xFFFFFFFE, -1.#QNANf).y, PV28.z t: ADD_INT R4.w, (0xFFFFFFFD, -1.#QNANf).z, PV28.z 30 x: MULADD_e ____, PV29.x, (0x3E800000, 0.25f).y, (0x3EAAAAAB, 0.3333333433f).x y: ADD ____, T0.y, -PV29.x z: MUL_e T1.z, PV29.x, PV29.x w: CNDE_INT T2.w, R2.w, -PV29.x, R7.z t: ADD R2.y, -R3.y, T0.w 31 x: ADD ____, T1.x, PV30.y y: AND_INT R5.y, (0xFFFFF000, -1.#QNANf).x, R49.x w: MULADD_e ____, T0.x, PV30.x, 0.5 VEC_201 32 x: ADD R7.x, R49.x, -PV31.y w: MULADD_e ____, PV31.w, T1.z, PV31.x 33 y: MULADD_e ____, R8.z, (0x3805FDF4, 0.0000319461833f).x, -PV32.w 34 z: ADD ____, PV33.y, T1.w 35 x: CNDE_INT T0.x, R2.w, PV34.z, -R1.w w: ADD R5.w, -T0.x, PV34.z 36 x: CNDE_INT R6.x, R2.w, PV35.w, T0.w 37 z: ADD ____, PV36.x, -T2.w 38 y: ADD R4.y, T0.x, -PV37.z 10 VTX: ADDR(522) CNT(1) 39 VFETCH R1, R6.z, fc130 MEGA(16) FETCH_TYPE(NO_INDEX_OFFSET) 11 ALU: ADDR(155) CNT(94) 40 z: CNDE_INT ____, R6.y, R1.y, R1.x 41 y: CNDE_INT ____, R6.w, R1.z, PV40.z 42 x: CNDE_INT ____, R4.w, R1.w, PV41.y 43 w: MULADD_e ____, R8.z, (0x3F317000, 0.6931152344f).x, PV42.x 44 z: ADD ____, R5.w, PV43.w w: CNDE_INT T2.w, R2.w, PV43.w, -R3.y VEC_120 45 x: CNDE_INT T0.x, R2.w, PV44.z, R2.y y: AND_INT R3.y, (0x7FFFFFFF, 1.#QNANf).x, R49.x 46 x: LSHR T1.x, PV45.y, (0x00000017, 3.222986468e-44f).x z: ADD ____, -PV45.x, T2.w w: AND_INT T2.w, (0xFFFFF000, -1.#QNANf).y, PV45.x 47 x: SUB_INT ____, (0x00000096, 2.101947696e-43f).x, PV46.x y: ADD ____, R6.x, PV46.z z: ADD T1.z, T0.x, -PV46.w VEC_120 48 x: ADD ____, R4.y, PV47.y y: AND_INT ____, PV47.x, (0x0000001F, 4.344025239e-44f).x 49 y: ADD T0.y, PV48.x, T1.z w: LSHL ____, 1, PV48.y 50 x: AND_INT ____, R49.x, PV49.w y: ADD_INT T1.y, (0xFFFFFF82, -1.#QNANf).x, T1.x z: MUL_e ____, PV49.y, R7.x VEC_021 w: ADD_INT ____, -1, PV49.w 51 x: SETNE_INT T1.x, PV50.x, 0.0f y: MULADD_e ____, T2.w, R7.x, PV50.z z: AND_INT ____, R49.x, PV50.w w: SETGT_INT T0.w, PV50.y, 0.0f 52 y: SETE_INT ____, PV51.z, 0.0f w: MULADD_e T1.w, T0.y, R5.y, PV51.y 53 x: SETGT_INT T1.x, (0x00000019, 3.503246161e-44f).x, T1.y z: MULADD_e R6.z, R5.y, T2.w, PV52.w w: AND_INT ____, T1.x, PV52.y 54 x: MUL_e ____, PV53.z, (0x42B8AA3B, 92.33248138f).x y: MULADD_e ____, R5.y, T2.w, -PV53.z z: SETE_DX10 T1.z, PV53.z, (0x42B17218, 88.72283936f).y w: AND_INT ____, PV53.w, T0.w VEC_021 t: SETGT_DX10 ____, PV53.z, (0x42B17218, 88.72283936f).y 55 x: AND_INT R61.x, PV54.w, T1.x y: AND_INT T1.y, PS54, 1 z: ADD T2.z, T1.w, PV54.y w: SETE_INT R6.w, R50.x, (0x7F800000, 1.#INFf).x t: F_TO_I R45.x, PV54.x 56 y: SETGT_DX10 ____, PV55.z, (0xB482E308, -0.0000002437957392f).x z: AND_INT ____, PS55, (0x0000003F, 8.828180325e-44f).y w: AND_INT ____, PV55.x, PV55.w t: I_TO_F T0.z, PS55 57 x: LSHL ____, PV56.z, (0x00000002, 2.802596929e-45f).x y: MULADD_e ____, PS56, (0xBC310000, -0.01080322266f).y, R6.z z: AND_INT ____, PV56.y, T1.z w: OR_INT ____, R51.x, R49.x t: AND_INT R60.x, PV56.w, 1 58 x: ADD_INT ____, PV57.x, (0x000009F0, 3.564903293e-42f).x y: ADD_INT ____, PV57.x, (0x00000210, 7.398855892e-43f).y z: AND_INT ____, PV57.z, 1 w: MULADD_e ____, T0.z, (0xB7E42FEF, -0.00002720203884f).z, PV57.y t: AND_INT ____, (0x7FFFFFFF, 1.#QNANf).w, PV57.w 59 x: OR_INT R44.x, PV58.z, T1.y y: ADD T0.y, PV58.w, T2.z z: ASHR ____, PV58.x, (0x00000002, 2.802596929e-45f).x w: ASHR T1.w, PV58.y, (0x00000002, 2.802596929e-45f).x t: SETE_INT R53.x, PS58, 0.0f 60 x: AND_INT ____, PV59.z, (0x00000003, 4.203895393e-45f).x y: MUL_e T1.y, PV59.y, PV59.y z: ASHR R8.z, PV59.z, (0x00000002, 2.802596929e-45f).y w: MULADD_e ____, PV59.y, (0x3D2AAAAB, 0.04166666791f).w, (0x3E2AAAAB, 0.1666666716f).z t: AND_INT T2.z, PV59.w, (0x00000003, 4.203895393e-45f).x 61 x: ADD_INT R7.x, -1, PV60.x y: ADD_INT R4.y, (0xFFFFFFFE, -1.#QNANf).x, PV60.x z: MULADD_e ____, PV60.w, T0.y, 0.5 w: ADD_INT R2.w, (0xFFFFFFFD, -1.#QNANf).y, PV60.x t: ASHR R5.y, T1.w, (0x00000002, 2.802596929e-45f).z 62 x: ADD_INT R6.x, -1, T2.z y: ADD_INT R2.y, (0xFFFFFFFD, -1.#QNANf).x, T2.z z: ADD_INT R1.z, (0xFFFFFFFE, -1.#QNANf).y, T2.z w: MULADD_e R4.w, PV61.z, T1.y, T0.y t: AND_INT R1.w, R53.x, 1 12 VTX: ADDR(524) CNT(2) 63 VFETCH R8, R8.z, fc130 MEGA(16) FETCH_TYPE(NO_INDEX_OFFSET) 64 VFETCH R5, R5.y, fc130 MEGA(16) FETCH_TYPE(NO_INDEX_OFFSET) 13 ALU: ADDR(249) CNT(94) 65 x: SETE_INT T1.x, R50.x, 0.0f VEC_201 y: SETGT_UINT T3.y, R3.y, (0x7F800000, 1.#INFf).x z: CNDE_INT ____, R7.x, R8.y, R8.x VEC_021 t: SETGT_UINT R46.x, R50.x, (0x7F800000, 1.#INFf).x 66 x: AND_INT ____, PV65.x, PV65.y y: CNDE_INT ____, R4.y, R8.z, PV65.z z: AND_INT T2.z, PS65, 1 w: CNDE_INT ____, R6.x, R5.y, R5.x VEC_021 t: SETE_INT R59.x, R49.x, R3.y 67 x: AND_INT R42.x, PV66.x, 1 y: AND_INT T0.y, T1.x, PS66 z: CNDE_INT ____, R1.z, R5.z, PV66.w w: CNDE_INT ____, R2.w, R8.w, PV66.y t: AND_INT R58.x, T1.x, 1 68 x: CNDE_INT R4.x, R2.y, R5.w, PV67.z y: SETE_INT T1.y, R51.x, (0x3F800000, 1.0f).x z: MULADD_e ____, PV67.w, R4.w, PV67.w VEC_102 w: SETNE_INT T0.w, R51.x, R50.x t: OR_INT R41.x, R1.w, PV67.x 69 x: MULADD_e R3.x, PV68.x, R4.w, PV68.z y: AND_INT T0.y, T3.y, PV68.y z: SETE_INT T1.z, R3.y, 0.0f VEC_120 w: SETE_INT T1.w, R3.y, (0x7F800000, 1.#INFf).x VEC_120 t: AND_INT R54.x, T0.y, 1 70 x: OR_INT R40.x, R41.x, PS69 y: AND_INT T2.y, R6.w, PV69.z z: SETGT_UINT T0.z, R3.y, (0x4F7FFFFF, 4294967040f).x w: SETE_INT T2.w, R49.x, (0x3F800000, 1.0f).y VEC_120 t: AND_INT R31.x, T0.w, 1 71 x: OR_INT R39.x, PV70.x, R58.x y: AND_INT T0.y, R59.x, PV70.z z: SETE_INT T3.z, R50.x, (0x3F800000, 1.0f).x VEC_201 w: SETGT_UINT T3.w, (0x3F800000, 1.0f).x, R50.x VEC_021 t: AND_INT R30.x, T0.y, 1 72 x: OR_INT R13.x, PV71.x, T2.z y: SETE_INT T1.y, R51.x, R50.x z: AND_INT T2.z, PV71.y, PV71.w w: XOR_INT ____, -1, R59.x VEC_021 t: AND_INT R29.x, T1.y, 1 73 x: OR_INT R12.x, PV72.x, R31.x y: SETGT_UINT R3.y, (0x2E800000, 5.820766091e-11f).x, R3.y z: AND_INT T0.z, T0.z, PV72.w w: AND_INT T1.w, R6.w, PV72.y t: AND_INT R28.x, PS72, T1.w 74 x: OR_INT R11.x, PV73.x, R30.x y: SETGT_DX10 T2.y, (0xC2CE8ED0, -103.2789307f).x, R6.z z: AND_INT ____, R61.x, T0.w t: AND_INT R27.x, T2.y, 1 75 x: OR_INT R10.x, PV74.x, R28.x y: AND_INT R2.y, PV74.z, 1 t: AND_INT R38.x, T1.z, 1 76 x: OR_INT R26.x, PV75.x, R27.x t: AND_INT R25.x, T3.y, 1 77 x: OR_INT R24.x, R29.x, PV76.x t: AND_INT R23.x, T2.w, 1 78 x: OR_INT R22.x, PV77.x, R38.x t: AND_INT R21.x, T2.z, 1 79 x: OR_INT R20.x, PV78.x, R25.x t: AND_INT R37.x, T0.y, 1 80 x: OR_INT R19.x, PV79.x, R23.x t: AND_INT R18.x, PS79, T3.z 81 x: OR_INT R17.x, PV80.x, R21.x t: AND_INT R43.x, T0.z, 1 82 x: OR_INT R16.x, PV81.x, R18.x t: AND_INT R36.x, T3.w, PS81 83 x: OR_INT R15.x, R37.x, PV82.x t: AND_INT R35.x, T3.z, R43.x 84 x: OR_INT R14.x, PV83.x, R36.x t: AND_INT R48.x, T1.w, 1 85 x: OR_INT R34.x, PV84.x, R35.x t: AND_INT R47.x, R6.w, 1 86 x: OR_INT R33.x, R43.x, PV85.x t: AND_INT R52.x, R58.x, T1.y 87 x: OR_INT R32.x, PV86.x, R48.x t: AND_INT R57.x, R3.y, 1 88 x: OR_INT R2.x, R60.x, PV87.x t: AND_INT R56.x, T2.y, 1 89 x: OR_INT R7.x, PV88.x, R47.x t: CNDE_INT R55.x, R2.y, 0.0f, (0x80000000, -0.0f).x 90 x: OR_INT R6.x, PV89.x, R52.x 91 x: OR_INT R5.x, R54.x, PV90.x 92 x: OR_INT R9.x, R58.x, PV91.x 93 x: OR_INT R8.x, PV92.x, R57.x 94 x: OR_INT R1.x, R44.x, PV93.x 95 w: OR_INT ____, PV94.x, R56.x 96 x: AND_INT R0.x, PV95.w, 1 14 MEM_SCRATCH_WRITE_ACK: VEC_PTR[1].x___, R0, ARRAY_SIZE(1) ELEM_SIZE(3) MARK 15 WAIT_ACK: Outstanding_acks <= 0 16 TEX: ADDR(528) CNT(1) 97 RD_SCRATCH R0.x___, VEC_PTR[1], ARRAY_SIZE(1) ELEM_SIZE(3) UNCACHED 17 ALU_PUSH_BEFORE: ADDR(343) CNT(1) 98 x: PREDE_INT ____, R0.x, 0.0f UPDATE_EXEC_MASK UPDATE_PRED 18 JUMP ADDR(20) 19 ALU: ADDR(344) CNT(16) 99 x: LSHL ____, R45.x, (0x00000011, 2.382207389e-44f).x z: ASHR ____, R45.x, (0x00000006, 8.407790786e-45f).y w: ADD T0.w, R4.x, R3.x VEC_120 100 x: SETGT_INT T0.x, (0xFFFFFF83, -1.#QNANf).x, PV99.z y: ADD_INT ____, PV99.z, (0x00000015, 2.942726775e-44f).y w: AND_INT ____, PV99.x, (0xFF800000, -1.#INFf).z 101 x: AND_INT ____, PV100.y, (0x0000001F, 4.344025239e-44f).x z: ADD_INT T0.z, T0.w, PV100.w 102 w: LSHL ____, 1, PV101.x 103 z: MUL_e ____, T0.w, PV102.w 104 y: CNDE_INT ____, T0.x, T0.z, PV103.z 105 x: OR_INT R1.x, R55.x, PV104.y 20 ELSE POP_CNT(1) ADDR(23) 21 ALU: ADDR(360) CNT(122) 106 x: AND_INT R1.x, R1.x, 1 y: AND_INT R0.y, R44.x, 1 VEC_120 w: AND_INT T1.w, R2.x, 1 VEC_201 107 x: AND_INT R0.x, R5.x, 1 y: AND_INT T2.y, R6.x, 1 VEC_120 z: AND_INT T2.z, R7.x, 1 VEC_201 108 y: AND_INT ____, R13.x, 1 z: AND_INT R0.z, R8.x, 1 VEC_120 w: AND_INT T3.w, R9.x, 1 VEC_201 109 x: AND_INT ____, R12.x, 1 y: XOR_INT ____, -1, PV108.y z: AND_INT T0.z, R10.x, 1 VEC_120 w: AND_INT ____, R11.x, 1 VEC_201 110 x: AND_INT T0.x, R24.x, 1 y: AND_INT ____, R26.x, 1 VEC_120 z: XOR_INT ____, -1, PV109.x w: AND_INT T2.w, R31.x, PV109.y VEC_201 t: XOR_INT ____, -1, PV109.w 111 x: AND_INT T3.x, R30.x, PV110.z y: AND_INT T3.y, R28.x, PS110 VEC_120 z: XOR_INT ____, -1, T0.z w: AND_INT ____, R22.x, 1 VEC_201 t: XOR_INT ____, -1, PV110.y 112 x: AND_INT ____, R20.x, 1 y: XOR_INT ____, -1, PV111.w z: AND_INT T3.z, R27.x, PV111.z VEC_120 w: AND_INT R0.w, R29.x, PS111 VEC_201 113 x: AND_INT ____, R19.x, 1 y: AND_INT R1.y, R25.x, PV112.y VEC_120 z: AND_INT ____, R17.x, 1 VEC_201 w: XOR_INT ____, -1, PV112.x 114 x: AND_INT ____, R15.x, 1 y: XOR_INT ____, -1, PV113.x z: AND_INT R1.z, R23.x, PV113.w VEC_120 w: AND_INT ____, R16.x, 1 VEC_201 t: XOR_INT ____, -1, PV113.z 115 x: AND_INT R3.x, R18.x, PS114 y: AND_INT ____, R14.x, 1 VEC_120 z: XOR_INT ____, -1, PV114.w w: AND_INT R1.w, R21.x, PV114.y VEC_201 t: XOR_INT ____, -1, PV114.x 116 x: AND_INT T1.x, R34.x, 1 y: AND_INT R2.y, R37.x, PV115.z VEC_120 z: AND_INT R2.z, R36.x, PS115 VEC_201 w: XOR_INT ____, -1, PV115.y 117 x: AND_INT T2.x, R33.x, 1 z: AND_INT T0.z, R32.x, 1 VEC_120 w: AND_INT R2.w, R35.x, PV116.w VEC_201 118 y: AND_INT ____, R42.x, (0x000000FF, 3.573311084e-43f).x z: AND_INT ____, R40.x, (0x000000FF, 3.573311084e-43f).x VEC_120 w: AND_INT ____, R41.x, (0x000000FF, 3.573311084e-43f).x VEC_201 119 x: XOR_INT ____, -1, PV118.y y: CNDE_INT T0.y, PV118.y, 0.0f, R49.x VEC_021 z: XOR_INT ____, -1, PV118.w w: XOR_INT ____, -1, PV118.z t: AND_INT ____, R39.x, (0x000000FF, 3.573311084e-43f).x 120 x: OR_INT T0.x, (0xFFFFFFFE, -1.#QNANf).x, PV119.w y: OR_INT ____, (0xFFFFFFFE, -1.#QNANf).x, PV119.z z: XOR_INT ____, -1, PS119 w: AND_INT ____, R53.x, PV119.x t: XOR_INT ____, -1, T0.x 121 x: AND_INT R2.x, R38.x, PS120 y: AND_INT T1.y, R54.x, PV120.y VEC_120 z: AND_INT ____, PV120.w, 1 w: AND_INT ____, R46.x, PV120.z VEC_201 122 x: CNDE_INT ____, PV121.z, T0.y, (0x7FC00000, 1.#QNANf).x y: XOR_INT ____, -1, T1.x z: AND_INT T1.z, PV121.w, 1 w: AND_INT T0.w, R58.x, T0.x VEC_021 123 x: AND_INT R4.x, R43.x, PV122.y y: CNDE_INT R3.y, R59.x, (0x80000000, -0.0f).x, R51.x VEC_102 z: CNDE_INT R3.z, R59.x, 0.0f, R51.x VEC_102 w: CNDE_INT ____, T1.y, PV122.x, 0.0f t: XOR_INT ____, -1, T0.z 124 x: AND_INT R6.x, R60.x, PS123 y: XOR_INT ____, -1, T2.x z: CNDE_INT ____, T0.w, PV123.w, (0x7F800000, 1.#INFf).x w: XOR_INT ____, -1, T1.w 125 x: AND_INT R5.x, R48.x, PV124.y y: CNDE_INT ____, T1.z, PV124.z, R51.x z: AND_INT R4.z, R47.x, PV124.w VEC_120 w: XOR_INT ____, -1, T2.z 126 x: AND_INT R7.x, R52.x, PV125.w y: CNDE_INT R4.y, R59.x, 0.0f, R50.x VEC_102 z: CNDE_INT ____, T2.w, PV125.y, (0x7FC00000, 1.#QNANf).x w: XOR_INT ____, -1, T2.y 127 x: AND_INT R8.x, R54.x, PV126.w w: CNDE_INT ____, T3.x, PV126.z, R49.x VEC_102 128 x: CNDE_INT ____, T3.y, PV127.w, (0x7FC00000, 1.#QNANf).x y: AND_INT ____, R61.x, 1 z: CNDE_INT R5.z, R59.x, (0x7F800000, 1.#INFf).y, R51.x VEC_102 w: XOR_INT ____, -1, T3.w 129 x: AND_INT R0.x, R57.x, PV128.w y: CNDE_INT ____, T3.z, PV128.x, (0x7FC00000, 1.#QNANf).x z: CNDE_INT R6.z, PV128.y, (0x80000000, -0.0f).y, R51.x w: CNDE_INT R3.w, PV128.y, (0x7F800000, 1.#INFf).w, (0xFF800000, -1.#INFf).z t: XOR_INT ____, -1, R0.x 130 x: OR_INT R1.x, R55.x, (0x7F800000, 1.#INFf).x y: XOR_INT ____, -1, R0.z z: CNDE_INT ____, R0.w, PV129.y, (0x3F800000, 1.0f).y w: AND_INT R0.w, R58.x, PS129 VEC_201 t: XOR_INT ____, -1, R1.x 131 x: AND_INT R2.x, R0.y, PV130.y y: AND_INT R0.y, R56.x, PS130 w: CNDE_INT ____, R2.x, PV130.z, (0x3F800000, 1.0f).x VEC_120 132 x: CNDE_INT ____, R1.y, PV131.w, R49.x 133 y: CNDE_INT ____, R1.z, PV132.x, R51.x 134 z: CNDE_INT ____, R1.w, PV133.y, 0.0f 135 w: CNDE_INT R1.w, R3.x, PV134.z, (0x3F800000, 1.0f).x 22 ALU_POP_AFTER: ADDR(482) CNT(17) 136 x: CNDE_INT ____, R2.y, R1.w, (0x7F800000, 1.#INFf).x 137 y: CNDE_INT ____, R2.z, PV136.x, (0x7F800000, 1.#INFf).x 138 z: CNDE_INT ____, R2.w, PV137.y, (0x3F800000, 1.0f).x 139 w: CNDE_INT ____, R4.x, PV138.z, 0.0f 140 w: CNDE_INT ____, R5.x, PV139.w, R3.z 141 w: CNDE_INT ____, R6.x, PV140.w, R3.y 142 w: CNDE_INT ____, R4.z, PV141.w, R4.y 143 w: CNDE_INT ____, R7.x, PV142.w, R5.z 144 z: CNDE_INT ____, R8.x, PV143.w, R6.z 145 z: CNDE_INT ____, R0.w, PV144.z, R3.w 146 w: CNDE_INT ____, R0.x, PV145.z, (0x3F800000, 1.0f).x 147 w: CNDE_INT ____, R2.x, PV146.w, R1.x 148 x: CNDE_INT R1.x, R0.y, PV147.w, R55.x 23 TEX: ADDR(530) CNT(1) 149 RD_SCRATCH R0.x___, VEC_PTR[0], ARRAY_SIZE(1) ELEM_SIZE(3) UNCACHED 24 ALU: ADDR(499) CNT(3) KCACHE0(CB1:0-15) 150 y: ADD_INT ____, KC0[1].x, R0.x 151 x: LSHR R0.x, PV150.y, (0x00000002, 2.802596929e-45f).x 25 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R0].x___, R1, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM

0 Likes

So it appears we are getting similar number of instruction clauses, but variation in GPR usage. I checked with the internal SDK and get values similar to what i reported before.

What CAL version does it show in your SKA?

0 Likes

Originally posted by: himanshu.gautam So it appears we are getting similar number of instruction clauses, but variation in GPR usage. I checked with the internal SDK and get values similar to what i reported before.

 

What CAL version does it show in your SKA?

 

"Use Latest Available (CAL 10.12) - v1.56.2677"


Name,GPR,Scratch Reg,Min,Max,Avg,ALU,Fetch,Write,Est Cycles,ALU:Fetch,BottleNeck,%s\Clock,Throughput
Radeon HD 5450,62,2,49.50,67.50,58.50,142,10,3,58.50,29.25,ALU Ops,0.07,44 M Threads\Sec
Radeon HD 4550,11,0,96.00,96.00,96.00,192,2,1,96.00,48.00,ALU Ops,0.08,50 M Threads\Sec
Radeon HD 4670,11,0,24.00,24.00,24.00,192,2,1,24.00,24.00,ALU Ops,0.33,250 M Threads\Sec
Radeon HD 5670,62,2,19.80,27.00,23.40,142,10,3,23.40,14.63,ALU Ops,0.34,265 M Threads\Sec
Radeon HD 4770,11,0,24.00,24.00,24.00,192,2,1,24.00,24.00,ALU Ops,0.67,500 M Threads\Sec
Radeon HD 4890,11,0,19.20,19.20,19.20,192,2,1,19.20,24.00,ALU Ops,0.83,708 M Threads\Sec
Radeon HD 4870,11,0,19.20,19.20,19.20,192,2,1,19.20,24.00,ALU Ops,0.83,625 M Threads\Sec
FireStream 9250,11,0,19.20,19.20,19.20,192,2,1,19.20,24.00,ALU Ops,0.83,521 M Threads\Sec
FireStream 9270,11,0,19.20,19.20,19.20,192,2,1,19.20,24.00,ALU Ops,0.83,625 M Threads\Sec
Radeon HD 5770,62,2,9.90,13.50,11.70,142,10,3,11.70,7.31,ALU Ops,1.37,1162 M Threads\Sec
Radeon HD 6870,62,2,14.14,19.29,16.71,142,10,3,16.71,7.31,ALU Ops,1.91,1723 M Threads\Sec
Radeon HD 5870,62,2,9.90,13.50,11.70,142,10,3,11.70,7.31,ALU Ops,2.74,2325 M Threads\Sec
Radeon HD 6970,63,0,10.33,12.50,11.42,157,8,1,11.42,4.28,ALU Ops,2.80,2523 M Threads\Sec

0 Likes

eugnek,

I get reasonable GPR allocation with internal implementations. So maybe you are doing something wrong.

Are you sure you are using SDK 2.3? Please post the clInfo output. 

0 Likes

I'm 100% sure that I'm using SDK 2.3 (downloaded from amd.com around end of January), and I don't see how I could be doing anything wrong, seeing how all I do is paste that code into Stream Kernel Analyzer. If I ask SKA to check for updates, it tells me that the my version is up to date. If you download the SDK from the official web site, you'll see the same thing. Maybe it was fixed since the last release.

Why would clInfo matter, isn't SKA platform agnostic?

0 Likes

Actually both are right from SKA 4000 series 11 GPRs, 5000 series 62 GPRs and very strange ISA!!!

0 Likes

Hi all,

Does the profiler also return similar numbers when the kernel is being actually used inside a program?

0 Likes

Originally posted by: himanshu.gautam Hi all,

 

Does the profiler also return similar numbers when the kernel is being actually used inside a program?

 


I can't say. I only have VS 2005, and the profiler does not work with VS 2005.

0 Likes

Originally posted by: himanshu.gautam Hi all,

Does the profiler also return similar numbers when the kernel is being actually used inside a program?



Are you testing with SKA 1.7?

0 Likes

aymankh,

Can you confirm that you get similar values from Profiler also.

Are you using the same code mention in the thread before? and what SKA?

0 Likes