cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

relpats_eht
Journeyman III

Raw ISA bytecode location

Hello,

I've been doing some work with the output of the OpenCL compiler for a project and am currently looking for the raw ISA bytecode output within the resulting binary. I've deduced the output of the OpenCL compiler looks something like this (with most sections omitted for brevity):

OpenCL Elf

  • .amdil: Text IL
  • .rodata: text runtime information
  • .text: AMD runtime elf
    • .text: IL bytecode
    • .text: Data containing ISA bytecode

The trouble is, I don't trust that second .text section in the AMD runtime elf as being strictly ISA bytecode. It generally looks like three different data sections, separated by chunks of zeros (in the sample programs I am compiling). I was wondering if someone could give me a bit of insight into the structure of this section. The r600 ISA spec is public, so I could try to decompile it by hand to see where it starts making sense, but I'd like to avoid the tedium if at all possible.

Thank you.

0 Likes
1 Solution
46 Replies

There is one more problem I just discovered but have not investigated:

I created an OpenCL kernel source file which contains two kernel functions (KFs). The first KF uses small amount of resources (registers, LDS, etc.) and the second KF uses a lot more. I ran KSA2 and exported the binary. The device configuration in the resulting ELF only shows the resources used by the first KF.

The question is if I decide to run the second KF, where does the driver find the resource (device config.) needed by the 2nd KF?

When you have 2 kernels, the structure of the inner and outer elf is the same as for 1 kernel and data for the 2 kernels is combined within the outer elf sections. The outer elf has 4 sections of interest. text, rodata, symtab, and strtab. The data is combined like this.

text = text1 + text2

rodata = rodata1 + rodata2

When you call the elf re-packer, it only sees text1 and rodata1 which come first in the sections.

To see the second kernel, you need to look at the symtab of the outer elf which has symbols for both kernels.

Something like:

       name           sec. index    "value"

__OpenCL_KF1_metadata      4         0

__OpenCL_KF1_kernel        5         0

__OpenCL_KF1_header        4         356

__OpenCL_KF2_metadata      4         388

__OpenCL_KF3_kernel        5         14912

__OpenCL_KF4_header        4         744

The elf symbol structure member "value" contains the offset within the sections (4=rodata, 5=text).

Use these offsets to determine what part of the sections you give to the inner elf processor.

Since you only run one kernel at a time, you only change and repack that kernel, so size calculations

can be done the same way as before.

drallan


Message was edited by: Robert Allan Barker

0 Likes

Excellent! I cannot find this in the AMD docs I have. Could you point me where this is documented?

jg1

0 Likes

jg1 wrote:

Excellent! I cannot find this in the AMD docs I have. Could you point me where this is documented?

jg1

I don't see this in any documents either but it's easy to see. Duplicate a kernel in a file and rename it so you have 2 identical kernels and compile. The elf dump section sizes are now exactly double but nothing else changes .

Only the symtab now refers to both kernels with offsets for kernel 2 into the second half of each section.

0 Likes

Re: the device configuration table and data class enum:

in asm.h, look for this:

/* Encoding dictionary entry */

struct si_bin_enc_dict_entry_t

{

          unsigned int userElementCount;

          struct si_bin_enc_user_element_t userElements[16];

};

I think the name "si_binenc_dict_entry_t" is incorrect............[more]............

That all looks good, the dots are connecting fast. Most things in multi2sim now fit together.

The other part is the buffers (referred to by user elements) used by the program. Most basic programs use 3,

1. UAV_TABLE or UAV_TABLE_PTR - has the 128 bit buffer descriptor(s) used by buffer_read/write insns.

2. IMM_CONST_BUFFER used for the NDrange information and other program constants

3. IMM_CONST_BUFFER containing the kernel arguments

The UAV_TABLE entries must be described by the struct si_buffer_resource_t.

What's interesting is the reference (in m2sim) to "Table 8.5 of the SI documentation" that describes the buffer resource. That's been in multi2sim for a long time, so there must be one more document somewhere.

0 Likes
realhet
Miniboss

So I checked hogy kernel function parameters are passed, and that turned out that it's quiet overcomplicated, too hard to  generate it by hand 

A (__global uint *a) -> (__global uint *a, __global uint *b) changes almost everything, except the 4KB zero sections. Here's a list:

;  userElements[0]    = IMM_UAV, 10, s[4:7]   //a

;  userElements[1]    = IMM_CONST_BUFFER, 0, s[8:11]  //a offset

;  userElements[2]    = IMM_CONST_BUFFER, 1, s[12:15]  //kernel range parameters

became

;  userElements[0]    = PTR_UAV_TABLE, 0, s[2:3]      //points to resource constants a,b

;  userElements[1]    = IMM_CONST_BUFFER, 0, s[4:7]   //kernel range parameters

;  userElements[2]    = IMM_CONST_BUFFER, 1, s[8:11]   //offsets for a and b

There is a limit that it can pass only 3 userelements only, and if it runs out then PTR_UAV_TABLE and/or PTR_CONST_TABLE comes in. I don't understand why it can't use more userElements, and why pack it into arrays because under cal you can use a lot more uavs/cb-es than 3.

COMPUTE_PGM_RSRC2:USER_SGPR      = 16

became

COMPUTE_PGM_RSRC2:USER_SGPR      = 12

Now I know were s16=WorkGroupID comes from.

Different uav/cb configs changing the symbol tables in the cal.elf

ocl.elf, .rodata: there is 2 new lines in the text part:

;pointer:a:u32:1:1:0:uav:10:4:RW:0:0

;pointer:b:u32:1:1:16:uav:11:4:RO:0:0  <- new line, containing variable name and read/write info +more, mapping to IL (uav10, uav11)

ocl.elf .rodata text part, exact type identifiers:

;reflection:0:uint*

;reflection:1:uint*

(.rodata binary part left unchanged)

Here are some example function headers and how to access each parameters:  [ <- denotes an indirect read (dword  address)

__kernel void Test(__global uint *a){

  a[get_global_id(0)]=5;}

COMPUTE_PGM_RSRC2:USER_SGPR      = 16

;  userElements[0]    = IMM_UAV, 10, s[4:7]              //a uav

;  userElements[1]    = IMM_CONST_BUFFER, 0, s[8:11]     //range params

;  userElements[2]    = IMM_CONST_BUFFER, 1, s[12:15]    //a offset

a[0]: resource(s[4:7]) offset s[12:15][0]

__kernel void Test(__global uint *a,  __global uint *b)

  a[get_global_id(0)] = b[get_global_id(0)];

;  userElements[0]    = PTR_UAV_TABLE, 0, s[2:3]         //points to a,b uav

;  userElements[1]    = IMM_CONST_BUFFER, 0, s[4:7]      //range params

;  userElements[2]    = IMM_CONST_BUFFER, 1, s[8:11]     //a,b offsets

a[0] at resource(s[2:3][0x50:0x53]) offset s[8:11][0]

b[0] at resource(s[2:3][0x58:0x5b]) offset s[8:11][4]

__kernel void Test(__global uint *a,  __global uint *b, __global uint *c){

  a[get_global_id(0)] = b[3]+c[5]; }

userElementCount     = 3;

;  userElements[0]    = PTR_UAV_TABLE, 0, s[2:3]         //points to a,b uav

;  userElements[1]    = IMM_CONST_BUFFER, 0, s[4:7]      //range params

;  userElements[2]    = IMM_CONST_BUFFER, 1, s[8:11]     //a,b,c offsets

a[0] at resource(s[2:3][0x50:0x53]) offset s[8:11][0]

b[0] at resource(s[2:3][0x58:0x5b]) offset s[8:11][4]

c[0] at resource(s[2:3][0x60:0x63]) offset s[8:11][8]

I gotta find out __constants later...

0 Likes

realhet wrote:

COMPUTE_PGM_RSRC2:USER_SGPR      = 16

became

COMPUTE_PGM_RSRC2:USER_SGPR      = 12

Now I know were s16=WorkGroupID comes from.

Great work, I've been looking for that one for a long time.

0 Likes

Hello,

here I decoded some __kernel parameter combinations.

They are complicated but there is logic in them.

And this time I don't used get_global_id() so it uses one less constant buffer, this way a 2 parameter kernel need no additional resource_constant reads from ptr_[uav/const]_tables. I think a 0 based WorkGroupID (sn) and a WorkItemID (v0) is more than enough, so I choose not to understand that kernel parameter buffer.

;opencl kernel parameters

;legend: g=get_global_id   u=UAV   c=CB

; declarations: u: __global uint *x    c: __constant uint *x

; dw(a,b) read dword from res_const(a) + dword offset b

; dwx4() read 4x dwords

; ofs x  means byte offset x

type     GroupID     Params

u          12          a=s[4:7] ofs dw(s[8:11],0)

uu         16          a=s[4:7] ofs dw(s[12:15],0)

                b=s[8:11] ofs dw(s[12:15],4)

uuu         8      a=dwx4(s[2:3],0x50) ofs dw(s[4:7],0)

                b=dwx4(s[2:3],0x58) ofs dw(s[4:7],4)

                c=dwx4(s[2:3],0x60) ofs dw(s[4:7],8)

uuuu        8      a=dwx4(s[2:3],0x50) ofs dw(s[4:7],0)

                b=dwx4(s[2:3],0x58) ofs dw(s[4:7],4)

                c=dwx4(s[2:3],0x60) ofs dw(s[4:7],8)

                d=dwx4(s[2:3],0x68) ofs dw(s[4:7],0xC)

uc         16      a=s[4:7] ofs dw(s[8:11],0)

                b=s[12:15] ofs 0

uuc        12          a=dwx4(s[2:3],0x50) ofs dw(s[4:7],0)

                b=dwx4(s[2:3],0x58) ofs dw(s[4:7],4)

                c=s[8 :11] ofs 0

uuuc       12          a=dwx4(s[2:3],0x50) ofs dw(s[4:7],0)

                b=dwx4(s[2:3],0x58) ofs dw(s[4:7],4)

                c=dwx4(s[2:3],0x60) ofs dw(s[4:7],8)

                d=s[8:11] ofs 0

ucc         8          a=s[4:7] ofs dw(dw4x(s[2:3],4),0)

                b=dw4x(s[2:3],8) ofs 0

                c=dw4x(s[2:3],12) ofs 0

uucc       12          a=s[4:7] ofs dw(dw4x(s[2:3],4),0)

                b=s[8:11] ofs dw(dw4x(s[2:3],4),4)

                c=dw4x(s[2:3],8) ofs 0

                d=dw4x(s[2:3],12) ofs 0

uccc        8          a=s[4:7] ofs dw(dw4x(s[2:3],4),0)

                b=dw4x(s[2:3],8) ofs 0

                c=dw4x(s[2:3],12) ofs 0

                d=dw4x(s[2:3],16) ofs 0

Btw, why it uses an additional offset for every UAV? They could easily put that offset into the resource constant, why make the ocl program slower when it could be faster?

Even I can bet it's always zero and they was used in the past when systems had only 1 or 8 UAVs max. I'll figure that out soon.

Another test I wanna check: what if I write into a constant buffer? Does it send that back the data to the host, as in cal? If so, then we can have 3 buffer parameters without additional cost.

0 Likes