9 Replies Latest reply on Mar 22, 2011 10:52 AM by llongeri

    CAL compilation hanging

    llongeri
      CAL compiler hands depending on constant value in a UMUL instruction

      Hi,

      I have come up with this IL code that hangs the CAL compiler with 100% CPU usage (ati-stream-sdk-v2.3).

      The code is a simple computation and uses some constants.

      The compilation seems to go fine up to the last 2 umul instructions. Here a register is multiplied by the constant 0x1000000 (L[2].z). the thing is, if I try changing ramdomly the constant value, like to 0x1000001, the compiler works fine.

      il_cs_2_0
      dcl_num_thread_per_group 64
      dcl_cb cb0[2]
      dcl_literal l0, 5, 10, 0x100, 100
      dcl_literal l1, 4, 3, 2, 1
      dcl_literal l2, 0x10000, 1000, 0x1000001, 0
      mov r0.x, vaTid.x
      mov r0.y, vThreadGrpId.x
      mov r0.z, vTidInGrp.x
      umul r0.w, r0.x, l[0].x
      iadd r1, r0.wwww, l[1]
      iadd r2, cb0[0].xxxx, r1
      iadd r3.x, cb0[0].x, r0.w
      umod r4, r2, l[0].yyyy
      umod r3.y, r3.x, l[0].y
      udiv r5, r2, l[0].yyyy
      udiv r3.z, r3.x, l[0].y
      umod r6, r5, l[0].yyyy
      umod r3.w, r3.z, l[0].y
      umul r5, r6, l[0].zzzz
      umul r3.z, r3.w, l[0].z
      iadd r6, r4, r5
      iadd r3.w, r3.y, r3.z
      udiv r4, r2, l[0].wwww
      udiv r3.y, r3.x, l[0].w
      umod r5, r4, l[0].yyyy
      umod r3.z, r3.y, l[0].y
      umul r4, r5, l[2].xxxx
      umul r3.y, r3.z, l[2].x
      iadd r5, r6, r4
      iadd r3.z, r3.w, r3.y
      udiv r4, r2, l[2].yyyy
      udiv r3.y, r3.x, l[2].y
      umod r6, r4, l[0].yyyy
      umod r3.w, r3.y, l[0].y
      umul r4, r6, l[2].zzzz
      umul r3.y, r3.w, l[2].z
      iadd r6, r5, r4
      iadd r3.w, r3.z, r3.y
      mov g[r0.x], r6
      end

        • CAL compilation hanging
          MicahVillmow
          llongeri,
          I don't see this issue with our upcoming SDK release with our internal tools. Do you have a small test app that we can use to attempt to reproduce it? Also, can you post the output of CLInfo.exe here so we know what the system setup is?
            • CAL compilation hanging
              llongeri

              This is de CLInfo output:

              Number of platforms:                 1
                Platform Profile:                 FULL_PROFILE
                Platform Version:                 OpenCL 1.1 ATI-Stream-v2.3 (451)
                Platform Name:                 ATI Stream
                Platform Vendor:                 Advanced Micro Devices, Inc.
                Platform Extensions:                 cl_khr_icd cl_amd_event_callback cl_amd_offline_devices


                Platform Name:                 ATI Stream
              Number of devices:                 2
                Device Type:                     CL_DEVICE_TYPE_GPU
                Device ID:                     4098
                Max compute units:                 18
                Max work items dimensions:             3
                  Max work items[0]:                 256
                  Max work items[1]:                 256
                  Max work items[2]:                 256
                Max work group size:                 256
                Preferred vector width char:             16
                Preferred vector width short:             8
                Preferred vector width int:             4
                Preferred vector width long:             2
                Preferred vector width float:             4
                Preferred vector width double:         0
                Native vector width char:             0
                Native vector width short:             0
                Native vector width int:             0
                Native vector width long:             0
                Native vector width float:             0
                Native vector width double:             0
                Max clock frequency:                 0Mhz
                Address bits:                     32
                Max memory allocation:             134217728
                Image support:                 Yes
                Max number of images read arguments:         128
                Max number of images write arguments:         8
                Max image 2D width:                 8192
                Max image 2D height:                 8192
                Max image 3D width:                 2048
                Max image 3D height:                 2048
                Max image 3D depth:                 2048
                Max samplers within kernel:             16
                Max size of kernel argument:             1024
                Alignment (bits) of base address:         32768
                Minimum alignment (bytes) for any datatype:     128
                Single precision floating point capability
                  Denorms:                     No
                  Quiet NaNs:                     Yes
                  Round to nearest even:             Yes
                  Round to zero:                 Yes
                  Round to +ve and infinity:             Yes
                  IEEE754-2008 fused multiply-add:         Yes
                Cache type:                     None
                Cache line size:                 0
                Cache size:                     0
                Global memory size:                 536870912
                Constant buffer size:                 65536
                Max number of constant args:             8
                Local memory type:                 Scratchpad
                Local memory size:                 32768
                Kernel Preferred work group size multiple:     64
                Error correction support:             0
                Unified memory for Host and Device:         0
                Profiling timer resolution:             1
                Device endianess:                 Little
                Available:                     Yes
                Compiler available:                 Yes
                Execution capabilities:                 
                  Execute OpenCL kernels:             Yes
                  Execute native function:             No
                Queue properties:                 
                  Out-of-Order:                 No
                  Profiling :                     Yes
                Platform ID:                     0x7fca63c79880
                Name:                         Cypress
                Vendor:                     Advanced Micro Devices, Inc.
                Driver version:                 CAL 1.4.900
                Profile:                     FULL_PROFILE
                Version:                     OpenCL 1.1 ATI-Stream-v2.3 (451)
                Extensions:                     cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt


                Device Type:                     CL_DEVICE_TYPE_CPU
                Device ID:                     4098
                Max compute units:                 4
                Max work items dimensions:             3
                  Max work items[0]:                 1024
                  Max work items[1]:                 1024
                  Max work items[2]:                 1024
                Max work group size:                 1024
                Preferred vector width char:             16
                Preferred vector width short:             8
                Preferred vector width int:             4
                Preferred vector width long:             2
                Preferred vector width float:             4
                Preferred vector width double:         0
                Native vector width char:             16
                Native vector width short:             8
                Native vector width int:             4
                Native vector width long:             2
                Native vector width float:             4
                Native vector width double:             0
                Max clock frequency:                 2400Mhz
                Address bits:                     64
                Max memory allocation:             1073741824
                Image support:                 No
                Max size of kernel argument:             4096
                Alignment (bits) of base address:         1024
                Minimum alignment (bytes) for any datatype:     128
                Single precision floating point capability
                  Denorms:                     Yes
                  Quiet NaNs:                     Yes
                  Round to nearest even:             Yes
                  Round to zero:                 Yes
                  Round to +ve and infinity:             Yes
                  IEEE754-2008 fused multiply-add:         No
                Cache type:                     Read/Write
                Cache line size:                 64
                Cache size:                     32768
                Global memory size:                 3221225472
                Constant buffer size:                 65536
                Max number of constant args:             8
                Local memory type:                 Global
                Local memory size:                 32768
                Kernel Preferred work group size multiple:     1
                Error correction support:             0
                Unified memory for Host and Device:         1
                Profiling timer resolution:             1
                Device endianess:                 Little
                Available:                     Yes
                Compiler available:                 Yes
                Execution capabilities:                 
                  Execute OpenCL kernels:             Yes
                  Execute native function:             Yes
                Queue properties:                 
                  Out-of-Order:                 No
                  Profiling :                     Yes
                Platform ID:                     0x7fca63c79880
                Name:                         Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
                Vendor:                     GenuineIntel
                Driver version:                 2.0
                Profile:                     FULL_PROFILE
                Version:                     OpenCL 1.1 ATI-Stream-v2.3 (451)
                Extensions:                     cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_media_ops cl_amd_popcnt cl_amd_printf

                • CAL compilation hanging
                  llongeri

                  Here is a small test app in C++ to reproduce it:

                  It hangs at the calclCompile call.

                  #include <time.h>
                  #include <cal.h>
                  #include <calcl.h>
                  #include <cal_ext.h>
                  #include <stdio.h>
                  #include <iostream>
                  #include <fstream>
                  #include <string>
                  #include <cstring>
                  #include <sstream>

                  using namespace std;

                  template <class T>
                  inline std::string to_string (const T& t)
                  {
                  std::stringstream ss;
                  ss << t;
                  return ss.str();
                  }

                  inline const string operator+(const string & s, int i)
                  {
                      return s + to_string(i);
                  }

                  string devicenames[] = {"R600", "RV610", "RV630", "RV670", "R700", "RV770", "RV710", "RV730",
                                          "CYPRESS", "JUNIPER", "REDWOOD", "CEDAR", "RESERVED0", "RESERVED1",
                                          "WRESTLER", "CAYMAN", "RESERVED2", "BARTS"};

                  class MyException
                  {
                      public:
                      MyException(string msg)
                      {
                          message = msg;
                      }
                      const string GetMessage()
                      {
                          return message;
                      }

                      private:
                      string message;
                  };

                  void logDisassemble(const char* msg)
                  {
                      printf("%s\n", msg);
                  }

                  #define CHECK_CAL(A) checkCalCall(A, #A)

                  void checkCalCall(int result, string op)
                  {
                      printf("%s: %d\n", op.c_str(), result);
                      if(result != CAL_RESULT_OK)
                      {
                          throw MyException(op + ": " + result);
                      }
                  }

                  int go(int argc, char ** argv);

                  int main(int argc, char ** argv)
                  {
                         try
                      {
                          go(argc, argv);
                      }
                      catch(MyException x)
                      {
                          cout<<"Exception: "<<x.GetMessage()<<endl;
                          return 1;
                      }
                      cout<<"OK"<<endl;
                      return 0;
                  }

                  int go(int argc, char ** argv)
                  {

                      int inlen = 8;
                      CALuint constants[8] = {0, 0, 0, 0, 0, 0, 0, 0};

                      string kernelsrc = "";
                      char line[1024];
                      int count = 0;
                      while (!cin.eof())
                      {
                          cin.getline(line, 1024);
                          kernelsrc = kernelsrc + line + "\n";
                      }

                      cout<<kernelsrc;

                      CHECK_CAL(calInit());

                      CALuint version[3];
                      CHECK_CAL(calGetVersion(&version[0], &version[1], &version[2]));
                      printf("CAL Runtime version %d.%d.%d\n", version[0], version[1], version[2]);

                      CALuint clversion[3];
                      CHECK_CAL(calclGetVersion(&clversion[0], &clversion[1], &clversion[2]));
                      printf("CAL Compiler     version %d.%d.%d\n", version[0], version[1], version[2]);

                      CALuint numDevices = 0;
                      CHECK_CAL(calDeviceGetCount(&numDevices));

                      cout<<"device count: "<<numDevices<<endl;

                      if (numDevices < 1)
                          throw MyException("no devices");

                      CALdeviceinfo info;
                      CHECK_CAL(calDeviceGetInfo(&info, 0));

                      CALobject object = NULL;
                      CALimage image = NULL;
                      CHECK_CAL(calclCompile(&object, CAL_LANGUAGE_IL, kernelsrc.c_str(), info.target));
                      CHECK_CAL(calclLink(&image, &object, 1));

                      printf("///////////////////////////////////////////////////////////////////////\n");
                      calclDisassembleObject(&object, &logDisassemble);
                      printf("///////////////////////////////////////////////////////////////////////\n");

                      CALdevice device = 0;
                      CHECK_CAL(calDeviceOpen(&device, 0));  // este es el numero del device

                      CALcontext ctx;
                      CHECK_CAL(calCtxCreate(&ctx, device));

                      CALresource output1Res = 0;
                      CHECK_CAL(calResAllocLocal2D(&output1Res, device, 256, 1, CAL_FORMAT_UNSIGNED_INT32_4, CAL_RESALLOC_GLOBAL_BUFFER));

                      CALresource constRes = 0;
                      CHECK_CAL(calResAllocLocal1D(&constRes, device, inlen, CAL_FORMAT_FLOAT32_1, 0));

                      CALuint* constPtr = NULL;
                      CALuint constPitch = 0;
                      CALmem constMem = 0;
                      CHECK_CAL(calResMap((CALvoid**)&constPtr, &constPitch, constRes, 0));
                      for (int i =0; i<inlen; i++) {
                          constPtr = constants;
                      }

                      CHECK_CAL(calResUnmap(constRes));
                      // Mapping output resource to CPU and initializing values
                      void* data1 = NULL;
                      // Getting memory handle from resources
                      CALmem output1Mem = 0;
                      CALuint pitch1 = 0;
                      CHECK_CAL(calResMap(&data1, &pitch1, output1Res, 0));
                      memset(data1, 0, 256 * sizeof(CALuint) * 4);
                      CHECK_CAL(calResUnmap(output1Res));
                      // Get memory handles for various resources
                      CHECK_CAL(calCtxGetMem(&constMem, ctx, constRes));
                      CHECK_CAL(calCtxGetMem(&output1Mem, ctx, output1Res));

                      // Creating module using compiled image
                      CALmodule module = 0;
                      CHECK_CAL(calModuleLoad(&module, ctx, image));
                      // Defining symbols in module
                      CALfunc func = 0;
                      CALname out1Name = 0;
                      CALname constName = 0;
                      // Defining entry point for the module
                      CHECK_CAL(calModuleGetEntry(&func, ctx, module, "main"));
                      CHECK_CAL(calModuleGetName(&out1Name, ctx, module, "g[]"));
                      CHECK_CAL(calModuleGetName(&constName, ctx, module, "cb0"));
                      // Setting input and output buffers
                      // used in the kernel
                      CHECK_CAL(calCtxSetMem(ctx, out1Name, output1Mem));
                      CHECK_CAL(calCtxSetMem(ctx, constName, constMem));
                      // Setting domain

                      // do kernel calc

                      //-----------------------------------------------------------------
                      // Executing kernel and waiting for kernel to terminate
                      //-----------------------------------------------------------------
                      // Event to check completion of the kernel
                      CALevent e = 0;
                      CALprogramGrid pg;
                      pg.func = func;
                      pg.gridBlock.width = 64;
                      pg.gridBlock.height = 1;
                      pg.gridBlock.depth = 1;
                      pg.gridSize.width = 4;
                      pg.gridSize.height = 1;
                      pg.gridSize.depth = 1;
                      pg.flags = 0;
                      CHECK_CAL(calCtxRunProgramGrid(&e, ctx, &pg));

                      // Checking whether the execution of the kernel is complete or not
                      while (calCtxIsEventDone(ctx, e) == CAL_RESULT_PENDING);
                      // Reading output from output resources
                      int *fdata;

                      cout<<"-- FEED BEGIN --"<<endl;

                      calResMap((CALvoid**)&fdata, &pitch1, output1Res, 0);
                      for (int i = 0; i < 1024; ++i)
                      {
                          printf("%u\n", fdata);
                      }
                      cout<<"-- FEED END --"<<endl;

                      CHECK_CAL(calResUnmap(output1Res));

                      // end

                      // Unloading the module
                      CHECK_CAL(calModuleUnload(ctx, module));
                      // Freeing compiled kernel binary
                      CHECK_CAL(calclFreeImage(image));
                      CHECK_CAL(calclFreeObject(object));
                      // Releasing resource from context
                      CHECK_CAL(calCtxReleaseMem(ctx, output1Mem));

                      // Deallocating resources
                      CHECK_CAL(calResFree(output1Res));

                      CHECK_CAL(calCtxDestroy(ctx));

                      CHECK_CAL(calDeviceClose(device));

                      CHECK_CAL(calShutdown());

                  }

                    • CAL compilation hanging
                      llongeri

                      Anyway, I can easily change the algorithm to compute the same desired value successfully, but the compiler shouldn't hang with this.

                        • CAL compilation hanging
                          llongeri

                          Hi, I just noticed that the IL code I pasted originally had the constant value 0x1000001 that makes it compile ok, it is with the value 0x1000000 that it doesn't compile:

                           

                          il_cs_2_0
                          dcl_num_thread_per_group 64
                          dcl_cb cb0[2]
                          dcl_literal l0, 5, 10, 0x100, 100
                          dcl_literal l1, 4, 3, 2, 1
                          dcl_literal l2, 0x10000, 1000, 0x1000000, 0
                          mov r0.x, vaTid.x
                          mov r0.y, vThreadGrpId.x
                          mov r0.z, vTidInGrp.x
                          umul r0.w, r0.x, l[0].x
                          iadd r1, r0.wwww, l[1]
                          iadd r2, cb0[0].xxxx, r1
                          iadd r3.x, cb0[0].x, r0.w
                          umod r4, r2, l[0].yyyy
                          umod r3.y, r3.x, l[0].y
                          udiv r5, r2, l[0].yyyy
                          udiv r3.z, r3.x, l[0].y
                          umod r6, r5, l[0].yyyy
                          umod r3.w, r3.z, l[0].y
                          umul r5, r6, l[0].zzzz
                          umul r3.z, r3.w, l[0].z
                          iadd r6, r4, r5
                          iadd r3.w, r3.y, r3.z
                          udiv r4, r2, l[0].wwww
                          udiv r3.y, r3.x, l[0].w
                          umod r5, r4, l[0].yyyy
                          umod r3.z, r3.y, l[0].y
                          umul r4, r5, l[2].xxxx
                          umul r3.y, r3.z, l[2].x
                          iadd r5, r6, r4
                          iadd r3.z, r3.w, r3.y
                          udiv r4, r2, l[2].yyyy
                          udiv r3.y, r3.x, l[2].y
                          umod r6, r4, l[0].yyyy
                          umod r3.w, r3.y, l[0].y
                          umul r4, r6, l[2].zzzz
                          umul r3.y, r3.w, l[2].z
                          iadd r6, r5, r4
                          iadd r3.w, r3.z, r3.y
                          mov g[r0.x], r6
                          end

                            • CAL compilation hanging
                              Jawed

                              Hmm, very curious.

                              If I add a new literal:

                              dcl_literal l3, 16, 1000, 24, 0

                              and change:

                              umul r4, r6, l[2].zzzz

                              into:

                              ishl r4, r6, l3.zzzz

                              The compiler hangs too.

                              I noticed that all r3 computations are dead code.

                              Also I noticed that only the .w component of:

                              umul r4, r6, l[2].zzzz

                              is being computed. This is puzzling me at the moment. Overall, for whatever reason, I can come up with a variety of ways of hanging the IL compiler based on your code. None of them are your fault.

                              If, on the other hand, I try the attached code, compilation is fine. Note that this code "fixes" the final addition. This is a real mess. Again, not your fault.

                              All my comments are based on testing with SKA 1.7.

                              il_cs_2_0 dcl_num_thread_per_group 64 dcl_cb cb0[2] dcl_literal l0, 5, 10, 0x100, 100 dcl_literal l1, 4, 3, 2, 1 dcl_literal l2, 0x10000, 1000, 0x1000000, 0 mov r0.x, vaTid.x mov r0.y, vThreadGrpId.x mov r0.z, vTidInGrp.x umul r0.w, r0.x, l[0].x iadd r1, r0.wwww, l[1] iadd r2, cb0[0].xxxx, r1 umod r4, r2, l[0].yyyy udiv r5, r2, l[0].yyyy umod r6, r5, l[0].yyyy umul r5, r6, l[0].zzzz iadd r6, r4, r5 udiv r4, r2, l[0].wwww umod r5, r4, l[0].yyyy umul r4, r5, l[2].xxxx iadd r5, r6, r4 udiv r4, r2, l[2].yyyy umod r6, r4, l[0].yyyy umul r4, r6, l[2].zzzz //iadd r6, r5, r4 iadd r6.x, r5.x, r4.x iadd r6.y, r5.y, r4.y iadd r6.z, r5.z, r4.z iadd r6.w, r5.w, r4.w/**/ mov g[r0.x], r6 end

                                • CAL compilation hanging
                                  llongeri

                                  Thanks Jawed,

                                  Well, the code that I attached is actually only the begining of a larger code, that is why it has some dead code, it was used by some code I deleted. I wanted to attach something small that keeps hanging the compiler rather than +1000 lines of code.

                                  And yes, doing a shift was my first alternativebut it also hangs.

                                  Thanks for your code, I am actually generating the IL from a high-level compiler, so I have to code something that translated to IL won't hang the CAL compiler. It's really frustating, I have hanged the CAL compiler with several codes.

                                  Since the troubling line is multiplying variable by 0x1000000, the simplest solution I could think of is to multiply twice by 2 constants that sum up to 0x1000000 (such as 0xFFFFFF and 1, with 1 I can save a mul), and then add the parts.

                                  It seams that the CAL compiler is trying to do some optimization due to the nature of the constant being 0x1000000 which is 1 << 24, and it hangs in the process. And since I am doing some udiv in between (which have no simple translation to the final ISA code by the CAL compiler) I guess it doesn't help.

                                    • CAL compilation hanging
                                      Jawed

                                      Are you using CAL++? If not, perhaps it's worth trying. I have never used it, but it might help you get around some gotchas.

                                      As for the IL compiler being stupid, the only thing I can suggest as a short-term work-around is to use a constant buffer entry for the constant rather than a literal. That way the IL compiler can't do any optimisation. But that depends on your high level tool.

                                      It could be that two or more literals are interacting and the IL compiler is oscillating amongst them in some bizarre evaluation of what is "best".

                                        • CAL compilation hanging
                                          llongeri

                                          I haven't try Cal++, I am using my own compiler, I wanted more control over the code generation and some tweaks and it is just fun, but I'll try Cal++ and other tools I have been finding around, looks interesting.

                                          And yes, I am pushing constants, normally I pass a zero in the constant buffer and add it to itseft into a register. And then add this to anything I want not to be optimized. Anyway the CAL compiler do suffle things around, normally it does a good job.