cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

the729
Journeyman III

Problems with indexed array and global buffer in IL

Both scatter_IL and scratch_buffer_IL examples work fine. However, combining these two features together seems problematic.

The testing IL kernel is:

il_ps_2_0
dcl_indexed_temp_array x0[2]
dcl_input vObjIndex0
mov x0[vObjIndex0.x], 1
mov r0, x0[0]
mov g[0], r0
endmain

I am using SKA1.1 and CAL 9.1 to compile it into RV770 assembly. It reads:

; --------  Disassembly --------------------
00 ALU: ADDR(32) CNT(4) 
      0  x: MOV         R1.x,  R1.x      
         y: MOV         R1.y,  R1.y      
         z: MOV         R1.z,  R1.z      
         w: MOV         R1.w,  R1.w      
01 ALU: ADDR(36) CNT(5) 
      1  x: MOVA_INT    ____,  R0.x      
      2  x: MOV         R4[A0.x].x,  R1.x      
         y: MOV         R4[A0.x].y,  R1.y      
         z: MOV         R4[A0.x].z,  R1.z      
         w: MOV         R4[A0.x].w,  R1.w      
02 ALU: ADDR(41) CNT(13) 
      3  x: MOVA_INT    ____,  0.0f      
      4  x: MOV         R0.x,  R4[A0.x].x      
         y: MOV         R0.y,  R4[A0.x].y      
         z: MOV         R0.z,  R4[A0.x].z      
         w: MOV         R0.w,  R4[A0.x].w      
      5  x: MOV         R4.x,  R0.x      
         y: MOV         R4.y,  R0.y      
         z: MOV         R4.z,  R0.z      
         w: MOV         R4.w,  R0.w      
      6  x: MOV         R0.x,  0.0f      
         y: MOV         R0.y,  0.0f      
         z: MOV         R0.z,  0.0f      
         w: MOV         R0.w,  0.0f      
03 EXP_DONE: PIX0, R0
END_OF_PROGRAM

; -------- End of Disassembly --------------------


It seems x0[] and g[] become identical, and the kernel contains no MEM_EXPORT_WRITE operation, so that it will not write the global buffer.

However, changing all x0[] into x1[] (including declaration) in the IL kernel solves the problem. Now it reads:

; --------  Disassembly --------------------
00 ALU: ADDR(32) CNT(4) 
      0  x: MOV         R1.x,  R1.x      
         y: MOV         R1.y,  R1.y      
         z: MOV         R1.z,  R1.z      
         w: MOV         R1.w,  R1.w      
01 MEM_SCRATCH_WRITE_IND_ACK: VEC_PTR[0+R0.x], R1, ARRAY_SIZE(1) ELEM_SIZE(3) 
02 WAIT_ACK:  Outstanding_acks <= 0 
03 VTX: ADDR(48) CNT(1) 
      1  RD_SCRATCH R0, VEC_PTR[0], ARRAY_SIZE(1) ELEM_SIZE(3) UNCACHED BURST_CNT(0) 
04 MEM_EXPORT_WRITE: DWORD_PTR[0], R0, ELEM_SIZE(3) 
05 ALU: ADDR(36) CNT(4) 
      2  x: MOV         R0.x,  0.0f      
         y: MOV         R0.y,  0.0f      
         z: MOV         R0.z,  0.0f      
         w: MOV         R0.w,  0.0f      
06 EXP_DONE: PIX0, R0
END_OF_PROGRAM

; -------- End of Disassembly --------------------

 

But in this version, scratch buffer is used instead of indexed registers. And if I'm not misunderstanding, scratch buffer is located in RAM and is much slower than registers.

I am wondering if this is a bug that x0 and x1 have different meanings. Or at least the disassembly result of first version is not what the IL kernel supposed to be.

0 Likes
16 Replies
the729
Journeyman III

Besides, why local array is still not supported in SDK1.4? Am I missing something?

0 Likes

There are no fully generic local arrays in our hardware, but the closest thing we have right now is the local data share in compute shader mode. However, this is a owners write model, so it is not fully generic and limits its usefulness to certain problem domains.

0 Likes

Is this bug in 1.4 or 1.3? If you are on 1.3, please upgrade to 1.4 and see if it still exists.

0 Likes

Hi Micah,

The bug is found in the Stream KernelAnalyzer 1.1, which is shipped with its own IL compilation dll (ILAssembler.dll in the installation directory).

I have not tried to compile the kernel with CAL APIs. I will try it later.

 

0 Likes

I did the test with Calcl APIs, and the results are the same as SKA's. I installed CAL 1.4 and driver 9.2, however calclGetVersion() returns 1.3.186.

0 Likes

It may be not so correlated to OP but still. I once tried to use array in my IL assembly. Just inserting "dcl_indexed_temp_array x0[16]" producing completely incorrect results. Note that no changes been made to kernel just inserting this line. Moreover, commenting this line out with ';' before "dcl" still isn't enough -- kernel working wrong. Only removing this line or changing it into something like ";dcl_indexed_temp_array_xx x0[16]" reanimates my kernel.

Very weird behaviour. Got it with SDK 1.3, haven't tried with 1.4 yet. But if there no real arrays in hardware it's no point anyway.

0 Likes

Originally posted by: empty_knapsack It may be not so correlated to OP but still. I once tried to use array in my IL assembly. Just inserting "dcl_indexed_temp_array x0[16]" producing completely incorrect results. Note that no changes been made to kernel just inserting this line. Moreover, commenting this line out with ';' before "dcl" still isn't enough -- kernel working wrong. Only removing this line or changing it into something like ";dcl_indexed_temp_array_xx x0[16]" reanimates my kernel.

 

Very weird behaviour. Got it with SDK 1.3, haven't tried with 1.4 yet. But if there no real arrays in hardware it's no point anyway.

 

Did you check the disassembly code of your kernel?
Even if it works without "dcl", it may be using the scratch buffer which is slower than regs.

0 Likes

Well, in fact I've realized that my code doesn't runs at all after I've inserted "dcl_indexed_temp_array" (commented out or not).

 

I'm getting error here:

 if (calCtxRunProgram(&e, ctx, func, &domain) != CAL_RESULT_OK) {
  printf("error in run [%s]\n", calGetErrorString());
  return 1;
 }

And error text is also very descriptive -- [Symbol "]. That's it, just single quote. Luckily I've already waste several hours before to realize that [Symbol "] should be something like [Symbol "XX" is not defined in function "YY"]. And (as I haven't allocate/bind name for global buffer) kernel using dcl_indexed_temp_array just failed to run.

I was just hoping that it's possible to use arrays in CAL IL but as I can understand there no real arrays in IL, only emulations via global buffer.

 

Still weird that commenting line out means nothing for compiler.

0 Likes

to Micah:

As far as I understand, indexed temp array just works in the way like local array, although there is not a dedicated space on the hardware. It can be located in the reg file if its size is small, or otherwise located in the scratch buffer (where is it on the hw?).

So I am looking forward for the support of indexed temp array in Brook+?

to empty_knapsack:

If the size of the array is small enough to fit in the GPRs, it will not use the scratch buffer (don't know if scratch buffer = global buffer).

If you do not declare the array, the compiler use a default size of 4096 which is too large to be placed in the reg file. And as I have described in the top post, arrays using the scratch buffer will not affect the function of global buffer. The bug is only seen if the array is placed in reg file, which, I guess, is the case when you insert "dcl". So I guess we are facing the same problem. 🙂

However, for me, commenting with ";" works (cal 1.3/1.4 and SKA). Please check if you forget "\n", which must be inserted after every IL line.

0 Likes

the729,

 

no, it's not missing '\n' problem -- I'm loading kernel from separate text file, so no problems with new lines. I've made some tests and realized that any declation of dcl_indexed_temp_array leads to declaring variable named "x[]" inside compiled image. (Grr, english isn't my native language and I have some difficulties to explain what I really mean :S).

 

Example:

I've have kernel like:

il_ps_2_0
dcl_input_position_interp(linear_noperspective) vWinCoord0.xy__
dcl_output_generic o0
dcl_output_generic o1
dcl_output_generic o2
dcl_cb cb0[4]
dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(3)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(4)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
sample_resource(0)_sampler(0) r0, vWinCoord0.xyxx
sample_resource(1)_sampler(0) r1, vWinCoord0.xyxx
sample_resource(2)_sampler(0) r2, vWinCoord0.xyxx
sample_resource(3)_sampler(0) r3, vWinCoord0.xyxx
sample_resource(4)_sampler(0) r4, vWinCoord0.xyxx
dcl_literal l1,0x7fffffff,0x7fffffff,0x7fffffff,0x7fffffff
dcl_literal l2,0x80000000,0x80000000,0x80000000,0x80000000
dcl_literal l3,0x80000001,0x80000001,0x80000001,0x80000001

iadd r10,r0,l1
iadd r11,r1,l2
iadd r12,r2,l3

mov o0,r10
mov o1,r11
mov o2,r12

end

I'm reading it from text file, compiling, linking and saving it as ELF image with:

 if (calclCompile(&obj, lang, pText, info.target) != CAL_RESULT_OK) {
  fprintf(stdout, "Kernel compilation failed. Exiting.\n");
  return 1;
 }
 if (calclLink(&image, &obj, 1) != CAL_RESULT_OK) {
  fprintf(stdout, "Kernel linking failed. Exiting.\n");
  return 1;
 }
 {
  CALint isize;
  calclImageGetSize(&isize, image);
  BYTE *px;

  px = (BYTE *)malloc(isize);
  calclImageWrite(px, isize, image);
  FILE *f;

  f = fopen("image.bin", "wb");
  fwrite(px, isize, 1, f);
  fclose(f);
  free(px);
 }

 Now looking at newly created image.bin I can see at very end declaration of all inputs/outputs -- "i4 i3 i2 i1 i0 s0 o2 o1 o0 cb0". Now, if I adding just one line into kernel, this "dcl_indexed_temp_array x0[2]" inputs/outputs string @ image.bin became "i4 i3 i2 i1 i0 s0 o2 o1 o0 x[] cb0".

 

And any use of "dcl_indexed_temp_array" leads to declaration of "x[]". Even if it's commented out -- still "x[]" appears. Even if there no referencies to declared array -- still "x[]" is there.

 

And (as all inputs/outputs needs to be bound before calling calCtxRunProgram) my program fails to run as there no global buffer, so nothing allocated/bound.

Adding something like:

 CALresource localRes;

 if (calResAllocRemote2D(&localRes, &device, 1, DIM_X, DIM_Y, CAL_FORMAT_UINT_4, CAL_RESALLOC_GLOBAL_BUFFER) != CAL_RESULT_OK) {
  printf("Error [%s]\n", calGetErrorString());
 }
 CALmem localMem;
 
 calCtxGetMem(&localMem, ctx, localRes);
 CALname localName;
 if (calModuleGetName(&localName, ctx, module, "x[]") != CAL_RESULT_OK) {
  printf("Error in getname [%s]\n", calGetErrorString());
 }
 calCtxSetMem(ctx, localName, localMem);

Solve problem with running as now "x[]" allocated/bound. But in fact it doesn't used at all inside kernel.

 

So I really doubt it's possible to declare any local array inside IL kernel, even very small one.

 

// Hope you got what I'm trying to explain here

0 Likes

Yeah, I just repeated your test and get the same result as yours.

Moreover, changing all x0[] in your IL kernel into x1[] will lead to a completely different result: I got x[] in image.bin when "dcl" is present and not commented, got segmentation fault if "dcl" is commented or not present.

0 Likes

empty_knapsack/the729,

 the indexed arrays, or scratch buffers as they are called in hardware, are stored in main memory but are not emulated in the global buffer. If the compiler can determine that your indexed array access can fit in registers, then it compiles it to registers either through static addressing via register copies or dynamic addressing using the ar register. The scratch buffer is mainly used for register spilling as required by the DX spec but was exposed in CAL as a method for thread local storage for CAL.

0 Likes

This seems like a valid bug, do either of you have a simple test case that you can email to streamdeveloper@amd.com attn: Micah Villmow so that I can work on getting it fixed?

0 Likes

I've just sent email to streamdeveloper@amd.com with this and another calcl bug descriptions. Though without any notations of your name.

 

But I'm a bit puzzled that you always require email while all necessary information already presents here at forum. Is it means that only emails matters while forum is just "for fun"? Some weird bureaucracy

0 Likes

Thanks for the test case, looking at the issue.

No, the email was for the test case as that provides me with the exact code base that you see causing the problem. It is not feasible to put the full test case on the forums in most cases and there might be difference in how we write the test case that might cause divergent results. This just removes variables from the test.

0 Likes

Ok,

 So this seems to be fixed in either 9.4 or 9.5, not sure when my internal compiler version will make it public. 

 

il1.il:

il_tester.exe -f il1.il -a

Program:                  il_tester.exe         Kernel  System

 WxH            In-Out   Src     Dst     Iter   GB/sec  GB/sec

File: il1.il - ShaderType = 1

TargetChip = w

;SC Dep components

NumClauseTemps = 4

 

; --------  Disassembly --------------------

00 ALU: ADDR(32) CNT(5)

      0  x: MOV         R1.x,  0.0f

         y: MOV         R1.y,  0.0f

         z: MOV         R1.z,  0.0f

         w: MOV         R1.w,  0.0f

         t: MOV         R2.x,  R0.y

01 ALU: ADDR(37) CNT(5)

      1  x: MOVA_INT    ____,  R0.x

      2  x: MOV         R5[A0.x].x,  R1.x

         y: MOV         R5[A0.x].y,  R1.y

         z: MOV         R5[A0.x].z,  R1.z

         w: MOV         R5[A0.x].w,  R1.w

02 ALU: ADDR(42) CNT(5)

      3  x: MOVA_INT    ____,  R2.x

      4  x: MOV         R2.x,  R5[A0.x].x

         y: MOV         R2.y,  R5[A0.x].y

         z: MOV         R2.z,  R5[A0.x].z

         w: MOV         R2.w,  R5[A0.x].w

03 MEM_EXPORT_WRITE: DWORD_PTR[0], R2, ELEM_SIZE(3)  VPM

04 ALU: ADDR(47) CNT(4)

      5  x: MOV         R2.x,  0.0f

         y: MOV         R2.y,  0.0f

         z: MOV         R2.z,  0.0f

         w: MOV         R2.w,  0.0f

05 EXP_DONE: PIX0, R2

END_OF_PROGRAM

 

il2.il

il_tester.exe -f il2.il -a

Program:                  il_tester.exe         Kernel  System

 WxH            In-Out   Src     Dst     Iter   GB/sec  GB/sec

File: il2.il - ShaderType = 1

TargetChip = w

;SC Dep components

NumClauseTemps = 4

 

; --------  Disassembly --------------------

00 ALU: ADDR(32) CNT(5)

      0  x: MOV         R1.x,  0.0f

         y: MOV         R1.y,  0.0f

         z: MOV         R1.z,  0.0f

         w: MOV         R1.w,  0.0f

         t: MOV         R2.x,  R0.y

01 MEM_SCRATCH_WRITE_IND_ACK: VEC_PTR[0+R0.x], R1, ARRAY_SIZE(1) ELEM_SIZE(3)

02 WAIT_ACK:  Outstanding_acks <= 0

03 TEX: ADDR(48) CNT(1) VALID_PIX

      1  RD_SCRATCH R2, VEC_PTR[0+R2.x], ARRAY_SIZE(1) ELEM_SIZE(3) UNCACHED

04 MEM_EXPORT_WRITE: DWORD_PTR[0], R2, ELEM_SIZE(3)  VPM

05 ALU: ADDR(37) CNT(4)

      2  x: MOV         R2.x,  0.0f

         y: MOV         R2.y,  0.0f

         z: MOV         R2.z,  0.0f

         w: MOV         R2.w,  0.0f

06 EXP_DONE: PIX0, R2

END_OF_PROGRAM

 

Both look correct but it seems that the scratch registers are not getting optimized away in the second example, which i've reported to the compiler team.





0 Likes