11 Replies Latest reply on Aug 11, 2017 3:11 AM by jpsollie

    clBuildProgram causes BRIG validation error

    jpsollie

      Hi Everyone,

      So, I'll first post my system:

      hardware:

      2x opteron 6276, 128GB ram, combined with R9 nano

      software:

      -linux 4.10.17 x64

      -LLVM 4.0.1 & 5.0.0 (git)

      -amd 17.30 opencl framework.

       

      problem:

      (I narrow down the config to only load the amdgpu-pro icd file)

      when I create a program which compiles on the CPU, it works fine,

      when doing this on the GPU, it crashes with the following error:

      ---------------------------------------------------------------------

      Error in hsa_operand section, at offset 121368:

      Address is outside of memory allocated for variable

      LLVM ERROR:

      Brig container validation has failed in BRIGAsmPrinter.cpp

      ------------------------------------------------------------------------

      what I tried:

      - Mesa OpenCL (compiles, but does not show a correct result)

      - pocl & llvm 5.0.0 (works perfectly)

      - amdgpu-pro CPU driver (2348.3) (works perfectly)

      -amdgpu-pro GPU driver (2442.7) same error as 2348, but does not show a CPU ...

       

      *edit:

      also tried oclGrind and CodeXL, no problem there

       

      I suspected the error to be somewhere with LLVM, but I already switched the PATH and LD_LIBRARY_PATH to point to LLVM 4, but it does not present any change.

       

      Where does this error come from? and how do I fix it?

       

      thanks

        • Re: clBuildProgram causes BRIG validation error
          dipak

          Hi,

          Please provide the repro code for our investigation. Also, please share the clinfo output and OS information.

          Hope, this is the driver where you observed the error: AMDGPU-PRO Driver for Linux Release Notes

           

          Regards,

            • Re: clBuildProgram causes BRIG validation error
              jpsollie

              no problem, here you go

              what do you want to know about my OS?

              I know Gentoo Linux is not supported, and neither is kernel 4.10.17, but I do not want you to present me a solution, just maybe ... maybe ... you guys know more about BRIG/ HSAIL compilation than I do

              • Re: clBuildProgram causes BRIG validation error
                jpsollie

                Hi Dipak,

                I got the code compiled (though I do not know why it works), but I saw the following at runtime debugging:

                atom_inc(system) does not atomically increase the value of local uint system[0], whereas atom_xchg(system, system[0] + 1) does.  Do I need to open a new thread for this?

                 

                *edit:

                I also saw this behaviour on clover running with LLVM 5.0

                pocl 0.14 (which I use on the opteron CPUs) shows no difference, it runs on LLVM 4.0.1

                 

                does this look like an LLVM error? or is compiler related?

                 

                *edit2:

                this piece of code:

                            if(!output[14]) output[14] = system[0] + 1;
                            atom_inc(system);
                            if(!output[15]) output[15] = system[0];

                outputs in gdb:

                Breakpoint 1, worker (device_obj=0x609490) at ./engine.c:397

                397                 if(answer[3] == 255) {

                (gdb) print answer

                $1 = {0, 0, 0, 0, 255, 276, 340, 804850955, 40962, 0, 0, 0, 0, 0, 1, 64}

                (gdb) print answer[14]

                $2 = 1

                (gdb) print answer[15]

                $3 = 64

                 

                the fact that clover also has this issue looks like an LLVM error, no? or am I mistaking?

                  • Re: clBuildProgram causes BRIG validation error
                    dipak

                    First of all, thanks for sharing the repro code. After a quick test, it looks like a compiler optimization issue. The kernel seems building fine if optimization is disable i.e set to "-O0" . I'll check a further on another setup and report to the compiler team, if required. Meanwhile could you please try the same and share your observation.

                     

                    Regarding the atomic query, I would suggest to open a new thread as it seems unrelated one. Also, it would help us to track these two issues separately. Please share the repro code and other setup details on that thread itself.

                     

                     

                    Regards,

                      • Re: clBuildProgram causes BRIG validation error
                        dipak

                        Update:

                        A ticket has been opened against this issue. Once I've any update about it, I'll share with you.

                         

                        Regards,

                          • Re: clBuildProgram causes BRIG validation error
                            dipak

                            Hi,

                            Please find the below comments from compiler team which indicate that the error is in the source file itself. Changing "finalcount" array size from 2 to 4 seems building the kernel successfully.

                             

                            ---------------------------------------------                  -----------------------------------------------------------------------------

                            The error is in the program source. It defines finalcount array of a size 2 bytes and then reads 4 bytes from it:

                            void SHA1Final(private unsigned char digest[20], ctxarray* ctx, private uchar* ctxbuffer)
                            {
                            unsigned char finalcount[2];
                            ...
                            SHA1Update(finalcount, 2, ctx, ctxbuffer);  /* Should cause a SHA1Transform() */

                            ...

                            void SHA1Update(private const unsigned char* data, private const uchar len, ctxarray* ctx, private uchar* ctxbuffer)
                            {
                            uchar i, j;
                            j = (ctx->l1 >> 3) & 63;
                            atom_add(&(ctx->l1), len << 3);
                            if (((j + len) & 64) != 0) {
                            os_memcpy(&ctxbuffer[j], data, (i = 64-j));

                            ...

                            void os_memcpy(private uchar* dest, private const uchar* src, const uchar amount) {
                            uchar j = 0, intamount = amount >> 2;
                            int* destination = (int*) dest;
                            const int* source = (const int*) src;
                            while(j < intamount) {
                            destination[j] = source[j];

                             

                            The HSAIL specification explicitely contanis a range check and we cannot omit it.

                            It also violates C standard (ISO/IEC 9899) in two ways:

                            1. 1. J.2 Undefined behavior

                            — Conversion between two pointer types produces a result that is incorrectly aligned
                            (6.3.2.3).

                            1. 2. J.2 Undefined behavior

                              Addition or subtraction of a pointer into, or just beyond, an array object and an
                            integer type produces a result that does not point into, or just beyond, the same array
                            object (6.5.6).

                            ---------------------------------------------                  -----------------------------------------------------------------------------

                             

                            Regards,

                            1 of 1 people found this helpful
                        • Re: clBuildProgram causes BRIG validation error
                          dipak

                          I can see the declaration as below:

                          local uint system[0]

                           

                          atom_add  uses 64-bit value and extension cl_khr_int64_base_atomics  to be enabled. Please try atomic_add instead for unsigned int.