5 Replies Latest reply on Apr 11, 2018 12:28 AM by matszpk

    Switching from ROCm to CLRadeonExtender, can't quite understand the kernel setup part.


      I wrote some GCN assembly code using ROCm. Everything works, but now I'm told it must work with AMDGPU-PRO drivers on Linux without ROCm and with Catalyst drivers on Windows, so I figured I'll switch to CLRadeonExtender (I must thank the author for creating it again), unless there's some way of compiling GCN asm code offline for the regular drivers and for windows with ROCm. I'm pretty sure the code itself will work, but I can't figure out the kernel setup and initialization part.


      For instance, I have a kernel that uses

      - 64 bit pointers;

      - 64 VGPRs;

      - 32 SGPRs (including VCC);

      - 4K of LDS;

      - 4K of GDS;

      - 2 64-bit pointers as the arguments.


      Below is the header/beginning of the file that I use with ROCm/HSA for this kernel, including loading of the parameters. Can anybody point me to the equivalent setup that I need to do with CLRadeonExtender?


      Thank you in advance,




      .hsa_code_object_version 2,0

      .hsa_code_object_isa 7, 0, 1, "AMD", "AMDGPU"



      .p2align 8

      .amdgpu_hsa_kernel test_kernel





                      enable_sgpr_kernarg_segment_ptr = 1

                      is_ptr64 = 1


                      // compute_pgm_rsrc1_vgprs = (workitem_vgpr_count-1)/4

                      compute_pgm_rsrc1_vgprs = 15

                      // compute_pgm_rsrc1_sgprs = (wavefront_sgpr_count-1)/8

                      compute_pgm_rsrc1_sgprs = 3


                      // num of registers holding input params;

                      // s0:s1 will point to a table of arguments

                      compute_pgm_rsrc2_user_sgpr = 2

                      // the next sgpr after the input params will hold the thread group id

                      compute_pgm_rsrc2_tgid_x_en = 1


                      // the params are 2 x 64-bit pointers

                      kernarg_segment_byte_size = 16


                      // 32 sgprs (including VCC, etc.)

                      wavefront_sgpr_count = 32

                      // 64 vgprs

                      workitem_vgpr_count = 64


                      // 4K of LDS

                      workgroup_group_segment_byte_size = 4096


                      // 4K of GDS

                      gds_segment_byte_size = 4096




              // actual code


              s_mov_b32 m0, 0x1000                    // 4k limit for the LDS

              s_mov_b32 s4, s2                        // copy group id to s4, since s0-s3 will be overwritten

              s_load_dwordx4 s[0:3], s[0:1], 0x0      // s0:1 = first data pointer, s2:3 = second data pointer

              s_waitcnt lgkmcnt(0)   


              // code



        • Re: Switching from ROCm to CLRadeonExtender, can't quite understand the kernel setup part.

          Thank you for using my program.

          In your case the sample code that set up kernel is:


          .amdcl2   # use AMD OpenCL 2.0 binary format (current windows drivers uses it)

          .64bit    # use 64-bit addressing

          # the GPU and has been set from command line

          .kernel test_kernel   # kernel definition

              .hsaconfig  # open kernel HSA configuration

                  .dims x         # use single dimension X in code

                                  # your compute_pgm_rsrc2_tgid_x_en = 1

                  .use_kernarg_segment_ptr   # your enable_sgpr_kernarg_segment_ptr = 1

                  .localsize 4096            # workgroup_group_segment_byte_size = 4096 (4k LDS)

                  #.workgroup_group_segment_size  # or that pseudo-op to segment local size

                  .gds_segment_size 4096     # gds_segment_byte_size = 4096 (4k GDS)

                  .kernarg_segment_size 16   # kernarg segment size in bytes (16 bytes)

                  .sgprsnum 32            # set number of SGPRs

                  .vgprsnum 64            # set number of VGPRs

                  # .userdatanum 2        # obsolete but you can specify number of user SGPR's

                  # the pgmrsrc's setup will be set automatically by assembler

                  # your kernel arguments

                  .arg .....


                  # your code


          This is a simpl configuration for Windows AMDCL2 in HSA configuration. In the original AMDCL2 kernel configuration the minimal user SGPRs is 4 (with user kernel argument pointer is 6 and depends on configuration). The setup in original configuration is very similar. I used HSA config to exactly reflect your kernel configuration.

          The AMD OpenCL 2.0 and ROCm kernel conventions calls uses different kernel arguments passing, you should convert this kernel argument offset in your to correct values for AMD OpenCL 2.0.

          The code in ROCm may not be work under Windows drivers due to different kernel argument passing and other kernel call conventions differences. The GDS space allocation may not be working under all systems (for example some Linux AMDGPU-PRO installations).