2 Replies Latest reply on Aug 19, 2010 11:58 PM by fruitfly1026

    About acmlgpu-1-1-1

      A question about libCALBLAS sample source in acmlgpu1-1-1


         I saw the GEMM_Shaders.h in acmlgpu1-1, which used to build the libCALBLAS library in "./src/libCALBLAS" subdirectory.

         The "szDGEMM_Mult" kernel have 8 inputs( 4 for A and 4 for B) and 8 outputs(8 'o'registers for C), but why the declaration part only declares 4 'o'registers, and why the compiling and running it have no problems? Besides, when I change to declare 8 outputs, the conpiling aand running proccess also right. But, when I change the kernel to "il_cs_2_0", the compiling cannot complete successfully.

        I'm confused now. Thank you for reply.

        • About acmlgpu-1-1-1

          Hello Fruitfly,

          First, you are right that all of the output O-registers should have been declared, but the IL compiler is very forgiving of this detail, so the kernel compiles correctly with the incomplete declarations.   Since this has been working OK, we never noticed the omission.

          If you just change a kernel type declaration from il_ps_xxx to il_cs_xxx and don't go thru the kernel and insure that you are only using features and registers that are available in a compute shader, the kernel will fail to compile.  I just tried to repeat that here to see what would happen, and the CAL IL compiler thru an exception in calclCompile().  Since that is clearly a bug, I will file a bug report to get that fixed.

          But to answer your question, I think the problem is the "dcl_input_position" declaration.  This input is not available to compute shaders, and is probably what's causing the compile to fail.  (There may be other statements in this shader that are not legal in a compute shader that I've missed.)

          I hope that helps,

            • About acmlgpu-1-1-1

              Hello DieInSente,

                First, thank you for your reply.

                I changed four kernels in GEMM_Shaders.h into compute shader, and they are szSGEMM_Part4, szSGEMM_Part4T, szDGEMM_Part4T and szDGEMM_Mult. Except for the szDGEMM_Mult kernel, the other three can be compiled successfully. Therefore, I use the   "dcl_input_position_interp(linear_noperspective) v0.xy__   \n" in all the four kernels.  Besides, in szDGEMM_Mult kernel, when I annotate the last 16 lines which are "dmul"s, the compilation will end successfully. What is the problem?

                 Hope for your reply.

              ps: As for szDGEMM_Mult kernel, I only add the "dcl_max_thread_per_group 64 \n", no change for the rest lines. The szSGEMM_Part4 is below.

              static const char * szSGEMM_Part4 = "il_cs_2_0 \n" "dcl_input_position_interp(linear_noperspective) v0.xy__ \n" "dcl_resource_id(0)_type(2d,unnorm)_fmtx(unknown)_fmty(unknown)_fmtz(unknown)_fmtw(unknown) \n" "dcl_cb cb0[1] \n" "dcl_output_generic o0 \n" "dcl_output_generic o1 \n" "dcl_output_generic o2 \n" "dcl_output_generic o3 \n" //declare threads number "dcl_max_thread_per_group 64 \n" "dcl_literal l0, 0x40800000, 0x00000000, 0x3f800000, 0x40000000 \n" "add r0.xy__, v0.xyyy, cb0[0].xyzw_neg(xyzw) \n" "mul r1._y__, r0.yyyy, l0.x \n" // 4Y "mov r0.x_z_, r0.xxxx \n" // < X, ?, X, ? > "mov r0._y_w, r1.yyyy \n" // < X, 4Y, X, 4Y > "add r0._y_w, r0, l0.yyyz \n" // < X, 4Y, X, 4Y+1 > "add r1, r0, l0.ywyw \n" // < X, 4Y+2, X, 4Y+3 > "sample_resource(0)_sampler(0) r2, r0.xyyy \n" // [ X ][ 4Y ] "sample_resource(0)_sampler(0) r3, r0.zwww \n" // [ X ][ 4Y+1 ] "sample_resource(0)_sampler(0) r4, r1.xyyy \n" // [ X ][ 4Y+2 ] "sample_resource(0)_sampler(0) r5, r1.zwww \n" // [ X ][ 4Y+3 ] "mov o0, r2 \n" "mov o1, r3 \n" "mov o2, r4 \n" "mov o3, r5 \n" "end\n";