cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

sp314
Adept II

Switching from ROCm to CLRadeonExtender, can't quite understand the kernel setup part.

I wrote some GCN assembly code using ROCm. Everything works, but now I'm told it must work with AMDGPU-PRO drivers on Linux without ROCm and with Catalyst drivers on Windows, so I figured I'll switch to CLRadeonExtender (I must thank the author for creating it again), unless there's some way of compiling GCN asm code offline for the regular drivers and for windows with ROCm. I'm pretty sure the code itself will work, but I can't figure out the kernel setup and initialization part.

For instance, I have a kernel that uses

- 64 bit pointers;

- 64 VGPRs;

- 32 SGPRs (including VCC);

- 4K of LDS;

- 4K of GDS;

- 2 64-bit pointers as the arguments.

Below is the header/beginning of the file that I use with ROCm/HSA for this kernel, including loading of the parameters. Can anybody point me to the equivalent setup that I need to do with CLRadeonExtender?

Thank you in advance,

sp

.hsa_code_object_version 2,0

.hsa_code_object_isa 7, 0, 1, "AMD", "AMDGPU"

.text

.p2align 8

.amdgpu_hsa_kernel test_kernel

test_kernel:

       

        .amd_kernel_code_t

                enable_sgpr_kernarg_segment_ptr = 1

                is_ptr64 = 1

                // compute_pgm_rsrc1_vgprs = (workitem_vgpr_count-1)/4

                compute_pgm_rsrc1_vgprs = 15

                // compute_pgm_rsrc1_sgprs = (wavefront_sgpr_count-1)/8

                compute_pgm_rsrc1_sgprs = 3

                // num of registers holding input params;

                // s0:s1 will point to a table of arguments

                compute_pgm_rsrc2_user_sgpr = 2

                // the next sgpr after the input params will hold the thread group id

                compute_pgm_rsrc2_tgid_x_en = 1

                // the params are 2 x 64-bit pointers

                kernarg_segment_byte_size = 16

                // 32 sgprs (including VCC, etc.)

                wavefront_sgpr_count = 32

                // 64 vgprs

                workitem_vgpr_count = 64

                // 4K of LDS

                workgroup_group_segment_byte_size = 4096

                // 4K of GDS

                gds_segment_byte_size = 4096

        .end_amd_kernel_code_t

        // actual code

        s_mov_b32 m0, 0x1000                    // 4k limit for the LDS

        s_mov_b32 s4, s2                        // copy group id to s4, since s0-s3 will be overwritten

        s_load_dwordx4 s[0:3], s[0:1], 0x0      // s0:1 = first data pointer, s2:3 = second data pointer

        s_waitcnt lgkmcnt(0)   

        // code

        s_endpgm

0 Likes
1 Solution
matszpk
Adept III

Thank you for using my program.

In your case the sample code that set up kernel is:

.amdcl2   # use AMD OpenCL 2.0 binary format (current windows drivers uses it)

.64bit    # use 64-bit addressing

# the GPU and has been set from command line

.kernel test_kernel   # kernel definition

    .hsaconfig  # open kernel HSA configuration

        .dims x         # use single dimension X in code

                        # your compute_pgm_rsrc2_tgid_x_en = 1

        .use_kernarg_segment_ptr   # your enable_sgpr_kernarg_segment_ptr = 1

        .localsize 4096            # workgroup_group_segment_byte_size = 4096 (4k LDS)

        #.workgroup_group_segment_size  # or that pseudo-op to segment local size

        .gds_segment_size 4096     # gds_segment_byte_size = 4096 (4k GDS)

        .kernarg_segment_size 16   # kernarg segment size in bytes (16 bytes)

        .sgprsnum 32            # set number of SGPRs

        .vgprsnum 64            # set number of VGPRs

        # .userdatanum 2        # obsolete but you can specify number of user SGPR's

        # the pgmrsrc's setup will be set automatically by assembler

        # your kernel arguments

        .arg .....

    .text

        # your code

This is a simpl configuration for Windows AMDCL2 in HSA configuration. In the original AMDCL2 kernel configuration the minimal user SGPRs is 4 (with user kernel argument pointer is 6 and depends on configuration). The setup in original configuration is very similar. I used HSA config to exactly reflect your kernel configuration.

The AMD OpenCL 2.0 and ROCm kernel conventions calls uses different kernel arguments passing, you should convert this kernel argument offset in your to correct values for AMD OpenCL 2.0.

The code in ROCm may not be work under Windows drivers due to different kernel argument passing and other kernel call conventions differences. The GDS space allocation may not be working under all systems (for example some Linux AMDGPU-PRO installations).

View solution in original post

5 Replies
matszpk
Adept III

Thank you for using my program.

In your case the sample code that set up kernel is:

.amdcl2   # use AMD OpenCL 2.0 binary format (current windows drivers uses it)

.64bit    # use 64-bit addressing

# the GPU and has been set from command line

.kernel test_kernel   # kernel definition

    .hsaconfig  # open kernel HSA configuration

        .dims x         # use single dimension X in code

                        # your compute_pgm_rsrc2_tgid_x_en = 1

        .use_kernarg_segment_ptr   # your enable_sgpr_kernarg_segment_ptr = 1

        .localsize 4096            # workgroup_group_segment_byte_size = 4096 (4k LDS)

        #.workgroup_group_segment_size  # or that pseudo-op to segment local size

        .gds_segment_size 4096     # gds_segment_byte_size = 4096 (4k GDS)

        .kernarg_segment_size 16   # kernarg segment size in bytes (16 bytes)

        .sgprsnum 32            # set number of SGPRs

        .vgprsnum 64            # set number of VGPRs

        # .userdatanum 2        # obsolete but you can specify number of user SGPR's

        # the pgmrsrc's setup will be set automatically by assembler

        # your kernel arguments

        .arg .....

    .text

        # your code

This is a simpl configuration for Windows AMDCL2 in HSA configuration. In the original AMDCL2 kernel configuration the minimal user SGPRs is 4 (with user kernel argument pointer is 6 and depends on configuration). The setup in original configuration is very similar. I used HSA config to exactly reflect your kernel configuration.

The AMD OpenCL 2.0 and ROCm kernel conventions calls uses different kernel arguments passing, you should convert this kernel argument offset in your to correct values for AMD OpenCL 2.0.

The code in ROCm may not be work under Windows drivers due to different kernel argument passing and other kernel call conventions differences. The GDS space allocation may not be working under all systems (for example some Linux AMDGPU-PRO installations).

Great, thank you for your help. CLRX is now working for me, at least on a simple test. One more question though - you've mentioned that OpenCL 2.0 uses a different calling convention. Is this calling convention documented anywhere, and could you point me to the doc?

Thanks again,
sp

0 Likes

Yes. The calling conventions are described in the CLRX documentation (http://clrx.nativeboinc.org/wiki/wiki/ClrxToc​ ) or in package (folder 'share/doc/CLRX/clrx' in package or installation) and 3rd-party documentation: 'AMD Catalyst ABI', 'AMD Catalyst OpenCL 2.0 ABI' and ROCm ABI in User Guide for AMDGPU Backend — LLVM 7 documentation

This actually helps a lot, and I think everything works for me now, GDS included, you rock and thank you!

One more question, if I may. You said that 'The GDS space allocation may not be working under all systems (for example some Linux AMDGPU-PRO installations).'. When and under which circumstances can it go wrong, and is there's anything in particular that I should test for? Your assembler is working for me just fine, at least on Ubuntu 16.04 with AMDGPU-PRO 16, thank you for making it, but what are the potential cases, setups, configurations, and circumstances under which it could break?

0 Likes

I am sorry for my stupid mistake (should be may not be working under some systems or some GPU devices). However, the GDS space is not used in the OpenCL and somebody reports some problems under the Linux drivers when it attempted to use GDS space. Look up at Could you add more information about Global Data Share (GDS)? · Issue #12 · CLRX/CLRX-mirror · GitHu... if you want learn more about this problem. Maybe this problem doesn't exists under Windows. I was not testing this feature under many configurations and many systems and I was not using a GDS space in my projects/programs.

0 Likes