16 Replies Latest reply on Mar 18, 2009 3:23 PM by MicahVillmow

    Problems with indexed array and global buffer in IL

    the729

      Both scatter_IL and scratch_buffer_IL examples work fine. However, combining these two features together seems problematic.

      The testing IL kernel is:

      il_ps_2_0
      dcl_indexed_temp_array x0[2]
      dcl_input vObjIndex0
      mov x0[vObjIndex0.x], 1
      mov r0, x0[0]
      mov g[0], r0
      endmain

      I am using SKA1.1 and CAL 9.1 to compile it into RV770 assembly. It reads:

      ; --------  Disassembly --------------------
      00 ALU: ADDR(32) CNT(4) 
            0  x: MOV         R1.x,  R1.x      
               y: MOV         R1.y,  R1.y      
               z: MOV         R1.z,  R1.z      
               w: MOV         R1.w,  R1.w      
      01 ALU: ADDR(36) CNT(5) 
            1  x: MOVA_INT    ____,  R0.x      
            2  x: MOV         R4[A0.x].x,  R1.x      
               y: MOV         R4[A0.x].y,  R1.y      
               z: MOV         R4[A0.x].z,  R1.z      
               w: MOV         R4[A0.x].w,  R1.w      
      02 ALU: ADDR(41) CNT(13) 
            3  x: MOVA_INT    ____,  0.0f      
            4  x: MOV         R0.x,  R4[A0.x].x      
               y: MOV         R0.y,  R4[A0.x].y      
               z: MOV         R0.z,  R4[A0.x].z      
               w: MOV         R0.w,  R4[A0.x].w      
            5  x: MOV         R4.x,  R0.x      
               y: MOV         R4.y,  R0.y      
               z: MOV         R4.z,  R0.z      
               w: MOV         R4.w,  R0.w      
            6  x: MOV         R0.x,  0.0f      
               y: MOV         R0.y,  0.0f      
               z: MOV         R0.z,  0.0f      
               w: MOV         R0.w,  0.0f      
      03 EXP_DONE: PIX0, R0
      END_OF_PROGRAM

      ; -------- End of Disassembly --------------------


      It seems x0[] and g[] become identical, and the kernel contains no MEM_EXPORT_WRITE operation, so that it will not write the global buffer.

      However, changing all x0[] into x1[] (including declaration) in the IL kernel solves the problem. Now it reads:

      ; --------  Disassembly --------------------
      00 ALU: ADDR(32) CNT(4) 
            0  x: MOV         R1.x,  R1.x      
               y: MOV         R1.y,  R1.y      
               z: MOV         R1.z,  R1.z      
               w: MOV         R1.w,  R1.w      
      01 MEM_SCRATCH_WRITE_IND_ACK: VEC_PTR[0+R0.x], R1, ARRAY_SIZE(1) ELEM_SIZE(3) 
      02 WAIT_ACK:  Outstanding_acks <= 0 
      03 VTX: ADDR(48) CNT(1) 
            1  RD_SCRATCH R0, VEC_PTR[0], ARRAY_SIZE(1) ELEM_SIZE(3) UNCACHED BURST_CNT(0) 
      04 MEM_EXPORT_WRITE: DWORD_PTR[0], R0, ELEM_SIZE(3) 
      05 ALU: ADDR(36) CNT(4) 
            2  x: MOV         R0.x,  0.0f      
               y: MOV         R0.y,  0.0f      
               z: MOV         R0.z,  0.0f      
               w: MOV         R0.w,  0.0f      
      06 EXP_DONE: PIX0, R0
      END_OF_PROGRAM

      ; -------- End of Disassembly --------------------

       

      But in this version, scratch buffer is used instead of indexed registers. And if I'm not misunderstanding, scratch buffer is located in RAM and is much slower than registers.

      I am wondering if this is a bug that x0 and x1 have different meanings. Or at least the disassembly result of first version is not what the IL kernel supposed to be.

        • Problems with indexed array and global buffer in IL
          the729

          Besides, why local array is still not supported in SDK1.4? Am I missing something?

          • Problems with indexed array and global buffer in IL
            MicahVillmow

            Is this bug in 1.4 or 1.3? If you are on 1.3, please upgrade to 1.4 and see if it still exists.

              • Problems with indexed array and global buffer in IL
                the729

                Hi Micah,

                The bug is found in the Stream KernelAnalyzer 1.1, which is shipped with its own IL compilation dll (ILAssembler.dll in the installation directory).

                I have not tried to compile the kernel with CAL APIs. I will try it later.

                 

                • Problems with indexed array and global buffer in IL
                  the729

                  I did the test with Calcl APIs, and the results are the same as SKA's. I installed CAL 1.4 and driver 9.2, however calclGetVersion() returns 1.3.186.

                    • Problems with indexed array and global buffer in IL
                      empty_knapsack

                      It may be not so correlated to OP but still. I once tried to use array in my IL assembly. Just inserting "dcl_indexed_temp_array x0[16]" producing completely incorrect results. Note that no changes been made to kernel just inserting this line. Moreover, commenting this line out with ';' before "dcl" still isn't enough -- kernel working wrong. Only removing this line or changing it into something like ";dcl_indexed_temp_array_xx x0[16]" reanimates my kernel.

                      Very weird behaviour. Got it with SDK 1.3, haven't tried with 1.4 yet. But if there no real arrays in hardware it's no point anyway.

                        • Problems with indexed array and global buffer in IL
                          the729

                           

                          Originally posted by: empty_knapsack It may be not so correlated to OP but still. I once tried to use array in my IL assembly. Just inserting "dcl_indexed_temp_array x0[16]" producing completely incorrect results. Note that no changes been made to kernel just inserting this line. Moreover, commenting this line out with ';' before "dcl" still isn't enough -- kernel working wrong. Only removing this line or changing it into something like ";dcl_indexed_temp_array_xx x0[16]" reanimates my kernel.

                           

                          Very weird behaviour. Got it with SDK 1.3, haven't tried with 1.4 yet. But if there no real arrays in hardware it's no point anyway.

                           

                          Did you check the disassembly code of your kernel?
                          Even if it works without "dcl", it may be using the scratch buffer which is slower than regs.

                            • Problems with indexed array and global buffer in IL
                              empty_knapsack

                              Well, in fact I've realized that my code doesn't runs at all after I've inserted "dcl_indexed_temp_array" (commented out or not).

                               

                              I'm getting error here:

                               if (calCtxRunProgram(&e, ctx, func, &domain) != CAL_RESULT_OK) {
                                printf("error in run [%s]\n", calGetErrorString());
                                return 1;
                               }

                              And error text is also very descriptive -- [Symbol "]. That's it, just single quote. Luckily I've already waste several hours before to realize that [Symbol "] should be something like [Symbol "XX" is not defined in function "YY"]. And (as I haven't allocate/bind name for global buffer) kernel using dcl_indexed_temp_array just failed to run.

                              I was just hoping that it's possible to use arrays in CAL IL but as I can understand there no real arrays in IL, only emulations via global buffer.

                               

                              Still weird that commenting line out means nothing for compiler.

                        • Problems with indexed array and global buffer in IL
                          the729

                          to Micah:

                          As far as I understand, indexed temp array just works in the way like local array, although there is not a dedicated space on the hardware. It can be located in the reg file if its size is small, or otherwise located in the scratch buffer (where is it on the hw?).

                          So I am looking forward for the support of indexed temp array in Brook+?

                          to empty_knapsack:

                          If the size of the array is small enough to fit in the GPRs, it will not use the scratch buffer (don't know if scratch buffer = global buffer).

                          If you do not declare the array, the compiler use a default size of 4096 which is too large to be placed in the reg file. And as I have described in the top post, arrays using the scratch buffer will not affect the function of global buffer. The bug is only seen if the array is placed in reg file, which, I guess, is the case when you insert "dcl". So I guess we are facing the same problem. :-)

                          However, for me, commenting with ";" works (cal 1.3/1.4 and SKA). Please check if you forget "\n", which must be inserted after every IL line.

                            • Problems with indexed array and global buffer in IL
                              empty_knapsack

                              the729,

                               

                              no, it's not missing '\n' problem -- I'm loading kernel from separate text file, so no problems with new lines. I've made some tests and realized that any declation of dcl_indexed_temp_array leads to declaring variable named "x[]" inside compiled image. (Grr, english isn't my native language and I have some difficulties to explain what I really mean :S).

                               

                              Example:

                              I've have kernel like:

                              il_ps_2_0
                              dcl_input_position_interp(linear_noperspective) vWinCoord0.xy__
                              dcl_output_generic o0
                              dcl_output_generic o1
                              dcl_output_generic o2
                              dcl_cb cb0[4]
                              dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                              dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                              dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                              dcl_resource_id(3)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                              dcl_resource_id(4)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                              sample_resource(0)_sampler(0) r0, vWinCoord0.xyxx
                              sample_resource(1)_sampler(0) r1, vWinCoord0.xyxx
                              sample_resource(2)_sampler(0) r2, vWinCoord0.xyxx
                              sample_resource(3)_sampler(0) r3, vWinCoord0.xyxx
                              sample_resource(4)_sampler(0) r4, vWinCoord0.xyxx
                              dcl_literal l1,0x7fffffff,0x7fffffff,0x7fffffff,0x7fffffff
                              dcl_literal l2,0x80000000,0x80000000,0x80000000,0x80000000
                              dcl_literal l3,0x80000001,0x80000001,0x80000001,0x80000001

                              iadd r10,r0,l1
                              iadd r11,r1,l2
                              iadd r12,r2,l3

                              mov o0,r10
                              mov o1,r11
                              mov o2,r12

                              end

                              I'm reading it from text file, compiling, linking and saving it as ELF image with:

                               if (calclCompile(&obj, lang, pText, info.target) != CAL_RESULT_OK) {
                                fprintf(stdout, "Kernel compilation failed. Exiting.\n");
                                return 1;
                               }
                               if (calclLink(&image, &obj, 1) != CAL_RESULT_OK) {
                                fprintf(stdout, "Kernel linking failed. Exiting.\n");
                                return 1;
                               }
                               {
                                CALint isize;
                                calclImageGetSize(&isize, image);
                                BYTE *px;

                                px = (BYTE *)malloc(isize);
                                calclImageWrite(px, isize, image);
                                FILE *f;

                                f = fopen("image.bin", "wb");
                                fwrite(px, isize, 1, f);
                                fclose(f);
                                free(px);
                               }

                               Now looking at newly created image.bin I can see at very end declaration of all inputs/outputs -- "i4 i3 i2 i1 i0 s0 o2 o1 o0 cb0". Now, if I adding just one line into kernel, this "dcl_indexed_temp_array x0[2]" inputs/outputs string @ image.bin became "i4 i3 i2 i1 i0 s0 o2 o1 o0 x[] cb0".

                               

                              And any use of "dcl_indexed_temp_array" leads to declaration of "x[]". Even if it's commented out -- still "x[]" appears. Even if there no referencies to declared array -- still "x[]" is there.

                               

                              And (as all inputs/outputs needs to be bound before calling calCtxRunProgram) my program fails to run as there no global buffer, so nothing allocated/bound.

                              Adding something like:

                               CALresource localRes;

                               if (calResAllocRemote2D(&localRes, &device, 1, DIM_X, DIM_Y, CAL_FORMAT_UINT_4, CAL_RESALLOC_GLOBAL_BUFFER) != CAL_RESULT_OK) {
                                printf("Error [%s]\n", calGetErrorString());
                               }
                               CALmem localMem;
                               
                               calCtxGetMem(&localMem, ctx, localRes);
                               CALname localName;
                               if (calModuleGetName(&localName, ctx, module, "x[]") != CAL_RESULT_OK) {
                                printf("Error in getname [%s]\n", calGetErrorString());
                               }
                               calCtxSetMem(ctx, localName, localMem);

                              Solve problem with running as now "x[]" allocated/bound. But in fact it doesn't used at all inside kernel.

                               

                              So I really doubt it's possible to declare any local array inside IL kernel, even very small one.

                               

                              // Hope you got what I'm trying to explain here

                                • Problems with indexed array and global buffer in IL
                                  the729

                                  Yeah, I just repeated your test and get the same result as yours.

                                  Moreover, changing all x0[] in your IL kernel into x1[] will lead to a completely different result: I got x[] in image.bin when "dcl" is present and not commented, got segmentation fault if "dcl" is commented or not present.

                                    • Problems with indexed array and global buffer in IL
                                      MicahVillmow

                                      empty_knapsack/the729,

                                       the indexed arrays, or scratch buffers as they are called in hardware, are stored in main memory but are not emulated in the global buffer. If the compiler can determine that your indexed array access can fit in registers, then it compiles it to registers either through static addressing via register copies or dynamic addressing using the ar register. The scratch buffer is mainly used for register spilling as required by the DX spec but was exposed in CAL as a method for thread local storage for CAL.

                                      • Problems with indexed array and global buffer in IL
                                        MicahVillmow

                                        This seems like a valid bug, do either of you have a simple test case that you can email to streamdeveloper@amd.com attn: Micah Villmow so that I can work on getting it fixed?

                                          • Problems with indexed array and global buffer in IL
                                            empty_knapsack

                                            I've just sent email to streamdeveloper@amd.com with this and another calcl bug descriptions. Though without any notations of your name.

                                             

                                            But I'm a bit puzzled that you always require email while all necessary information already presents here at forum. Is it means that only emails matters while forum is just "for fun"? Some weird bureaucracy

                                              • Problems with indexed array and global buffer in IL
                                                MicahVillmow

                                                Thanks for the test case, looking at the issue.

                                                No, the email was for the test case as that provides me with the exact code base that you see causing the problem. It is not feasible to put the full test case on the forums in most cases and there might be difference in how we write the test case that might cause divergent results. This just removes variables from the test.

                                                  • Problems with indexed array and global buffer in IL
                                                    MicahVillmow

                                                    Ok,

                                                     So this seems to be fixed in either 9.4 or 9.5, not sure when my internal compiler version will make it public. 

                                                     

                                                    il1.il:

                                                    il_tester.exe -f il1.il -a

                                                     

                                                    Program:                  il_tester.exe         Kernel  System

                                                     WxH            In-Out   Src     Dst     Iter   GB/sec  GB/sec

                                                    File: il1.il - ShaderType = 1

                                                    TargetChip = w

                                                    ;SC Dep components

                                                    NumClauseTemps = 4

                                                     

                                                    ; --------  Disassembly --------------------

                                                    00 ALU: ADDR(32) CNT(5)

                                                          0  x: MOV         R1.x,  0.0f

                                                             y: MOV         R1.y,  0.0f

                                                             z: MOV         R1.z,  0.0f

                                                             w: MOV         R1.w,  0.0f

                                                             t: MOV         R2.x,  R0.y

                                                    01 ALU: ADDR(37) CNT(5)

                                                          1  x: MOVA_INT    ____,  R0.x

                                                          2  x: MOV         R5[A0.x].x,  R1.x

                                                             y: MOV         R5[A0.x].y,  R1.y

                                                             z: MOV         R5[A0.x].z,  R1.z

                                                             w: MOV         R5[A0.x].w,  R1.w

                                                    02 ALU: ADDR(42) CNT(5)

                                                          3  x: MOVA_INT    ____,  R2.x

                                                          4  x: MOV         R2.x,  R5[A0.x].x

                                                             y: MOV         R2.y,  R5[A0.x].y

                                                             z: MOV         R2.z,  R5[A0.x].z

                                                             w: MOV         R2.w,  R5[A0.x].w

                                                    03 MEM_EXPORT_WRITE: DWORD_PTR[0], R2, ELEM_SIZE(3)  VPM

                                                    04 ALU: ADDR(47) CNT(4)

                                                          5  x: MOV         R2.x,  0.0f

                                                             y: MOV         R2.y,  0.0f

                                                             z: MOV         R2.z,  0.0f

                                                             w: MOV         R2.w,  0.0f

                                                    05 EXP_DONE: PIX0, R2

                                                    END_OF_PROGRAM

                                                     

                                                    il2.il

                                                     

                                                    il_tester.exe -f il2.il -a

                                                    Program:                  il_tester.exe         Kernel  System

                                                     WxH            In-Out   Src     Dst     Iter   GB/sec  GB/sec

                                                    File: il2.il - ShaderType = 1

                                                    TargetChip = w

                                                    ;SC Dep components

                                                    NumClauseTemps = 4

                                                     

                                                    ; --------  Disassembly --------------------

                                                    00 ALU: ADDR(32) CNT(5)

                                                          0  x: MOV         R1.x,  0.0f

                                                             y: MOV         R1.y,  0.0f

                                                             z: MOV         R1.z,  0.0f

                                                             w: MOV         R1.w,  0.0f

                                                             t: MOV         R2.x,  R0.y

                                                    01 MEM_SCRATCH_WRITE_IND_ACK: VEC_PTR[0+R0.x], R1, ARRAY_SIZE(1) ELEM_SIZE(3)

                                                    02 WAIT_ACK:  Outstanding_acks <= 0

                                                    03 TEX: ADDR(48) CNT(1) VALID_PIX

                                                          1  RD_SCRATCH R2, VEC_PTR[0+R2.x], ARRAY_SIZE(1) ELEM_SIZE(3) UNCACHED

                                                    04 MEM_EXPORT_WRITE: DWORD_PTR[0], R2, ELEM_SIZE(3)  VPM

                                                    05 ALU: ADDR(37) CNT(4)

                                                          2  x: MOV         R2.x,  0.0f

                                                             y: MOV         R2.y,  0.0f

                                                             z: MOV         R2.z,  0.0f

                                                             w: MOV         R2.w,  0.0f

                                                    06 EXP_DONE: PIX0, R2

                                                    END_OF_PROGRAM

                                                     

                                                    Both look correct but it seems that the scratch registers are not getting optimized away in the second example, which i've reported to the compiler team.