AnsweredAssumed Answered

Switching from ROCm to CLRadeonExtender, can't quite understand the kernel setup part.

Question asked by sp314 on Mar 27, 2018
Latest reply on Apr 11, 2018 by matszpk

I wrote some GCN assembly code using ROCm. Everything works, but now I'm told it must work with AMDGPU-PRO drivers on Linux without ROCm and with Catalyst drivers on Windows, so I figured I'll switch to CLRadeonExtender (I must thank the author for creating it again), unless there's some way of compiling GCN asm code offline for the regular drivers and for windows with ROCm. I'm pretty sure the code itself will work, but I can't figure out the kernel setup and initialization part.


For instance, I have a kernel that uses

- 64 bit pointers;

- 64 VGPRs;

- 32 SGPRs (including VCC);

- 4K of LDS;

- 4K of GDS;

- 2 64-bit pointers as the arguments.


Below is the header/beginning of the file that I use with ROCm/HSA for this kernel, including loading of the parameters. Can anybody point me to the equivalent setup that I need to do with CLRadeonExtender?


Thank you in advance,




.hsa_code_object_version 2,0

.hsa_code_object_isa 7, 0, 1, "AMD", "AMDGPU"



.p2align 8

.amdgpu_hsa_kernel test_kernel





                enable_sgpr_kernarg_segment_ptr = 1

                is_ptr64 = 1


                // compute_pgm_rsrc1_vgprs = (workitem_vgpr_count-1)/4

                compute_pgm_rsrc1_vgprs = 15

                // compute_pgm_rsrc1_sgprs = (wavefront_sgpr_count-1)/8

                compute_pgm_rsrc1_sgprs = 3


                // num of registers holding input params;

                // s0:s1 will point to a table of arguments

                compute_pgm_rsrc2_user_sgpr = 2

                // the next sgpr after the input params will hold the thread group id

                compute_pgm_rsrc2_tgid_x_en = 1


                // the params are 2 x 64-bit pointers

                kernarg_segment_byte_size = 16


                // 32 sgprs (including VCC, etc.)

                wavefront_sgpr_count = 32

                // 64 vgprs

                workitem_vgpr_count = 64


                // 4K of LDS

                workgroup_group_segment_byte_size = 4096


                // 4K of GDS

                gds_segment_byte_size = 4096




        // actual code


        s_mov_b32 m0, 0x1000                    // 4k limit for the LDS

        s_mov_b32 s4, s2                        // copy group id to s4, since s0-s3 will be overwritten

        s_load_dwordx4 s[0:3], s[0:1], 0x0      // s0:1 = first data pointer, s2:3 = second data pointer

        s_waitcnt lgkmcnt(0)   


        // code