cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

llongeri
Journeyman III

CAL compilation hanging

CAL compiler hands depending on constant value in a UMUL instruction

Hi,

I have come up with this IL code that hangs the CAL compiler with 100% CPU usage (ati-stream-sdk-v2.3).

The code is a simple computation and uses some constants.

The compilation seems to go fine up to the last 2 umul instructions. Here a register is multiplied by the constant 0x1000000 (L[2].z). the thing is, if I try changing ramdomly the constant value, like to 0x1000001, the compiler works fine.

il_cs_2_0
dcl_num_thread_per_group 64
dcl_cb cb0[2]
dcl_literal l0, 5, 10, 0x100, 100
dcl_literal l1, 4, 3, 2, 1
dcl_literal l2, 0x10000, 1000, 0x1000001, 0
mov r0.x, vaTid.x
mov r0.y, vThreadGrpId.x
mov r0.z, vTidInGrp.x
umul r0.w, r0.x, l[0].x
iadd r1, r0.wwww, l[1]
iadd r2, cb0[0].xxxx, r1
iadd r3.x, cb0[0].x, r0.w
umod r4, r2, l[0].yyyy
umod r3.y, r3.x, l[0].y
udiv r5, r2, l[0].yyyy
udiv r3.z, r3.x, l[0].y
umod r6, r5, l[0].yyyy
umod r3.w, r3.z, l[0].y
umul r5, r6, l[0].zzzz
umul r3.z, r3.w, l[0].z
iadd r6, r4, r5
iadd r3.w, r3.y, r3.z
udiv r4, r2, l[0].wwww
udiv r3.y, r3.x, l[0].w
umod r5, r4, l[0].yyyy
umod r3.z, r3.y, l[0].y
umul r4, r5, l[2].xxxx
umul r3.y, r3.z, l[2].x
iadd r5, r6, r4
iadd r3.z, r3.w, r3.y
udiv r4, r2, l[2].yyyy
udiv r3.y, r3.x, l[2].y
umod r6, r4, l[0].yyyy
umod r3.w, r3.y, l[0].y
umul r4, r6, l[2].zzzz
umul r3.y, r3.w, l[2].z
iadd r6, r5, r4
iadd r3.w, r3.z, r3.y
mov g[r0.x], r6
end

0 Likes
9 Replies

llongeri,
I don't see this issue with our upcoming SDK release with our internal tools. Do you have a small test app that we can use to attempt to reproduce it? Also, can you post the output of CLInfo.exe here so we know what the system setup is?
0 Likes

This is de CLInfo output:

Number of platforms:                 1
  Platform Profile:                 FULL_PROFILE
  Platform Version:                 OpenCL 1.1 ATI-Stream-v2.3 (451)
  Platform Name:                 ATI Stream
  Platform Vendor:                 Advanced Micro Devices, Inc.
  Platform Extensions:                 cl_khr_icd cl_amd_event_callback cl_amd_offline_devices


  Platform Name:                 ATI Stream
Number of devices:                 2
  Device Type:                     CL_DEVICE_TYPE_GPU
  Device ID:                     4098
  Max compute units:                 18
  Max work items dimensions:             3
    Max work items[0]:                 256
    Max work items[1]:                 256
    Max work items[2]:                 256
  Max work group size:                 256
  Preferred vector width char:             16
  Preferred vector width short:             8
  Preferred vector width int:             4
  Preferred vector width long:             2
  Preferred vector width float:             4
  Preferred vector width double:         0
  Native vector width char:             0
  Native vector width short:             0
  Native vector width int:             0
  Native vector width long:             0
  Native vector width float:             0
  Native vector width double:             0
  Max clock frequency:                 0Mhz
  Address bits:                     32
  Max memory allocation:             134217728
  Image support:                 Yes
  Max number of images read arguments:         128
  Max number of images write arguments:         8
  Max image 2D width:                 8192
  Max image 2D height:                 8192
  Max image 3D width:                 2048
  Max image 3D height:                 2048
  Max image 3D depth:                 2048
  Max samplers within kernel:             16
  Max size of kernel argument:             1024
  Alignment (bits) of base address:         32768
  Minimum alignment (bytes) for any datatype:     128
  Single precision floating point capability
    Denorms:                     No
    Quiet NaNs:                     Yes
    Round to nearest even:             Yes
    Round to zero:                 Yes
    Round to +ve and infinity:             Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                     None
  Cache line size:                 0
  Cache size:                     0
  Global memory size:                 536870912
  Constant buffer size:                 65536
  Max number of constant args:             8
  Local memory type:                 Scratchpad
  Local memory size:                 32768
  Kernel Preferred work group size multiple:     64
  Error correction support:             0
  Unified memory for Host and Device:         0
  Profiling timer resolution:             1
  Device endianess:                 Little
  Available:                     Yes
  Compiler available:                 Yes
  Execution capabilities:                 
    Execute OpenCL kernels:             Yes
    Execute native function:             No
  Queue properties:                 
    Out-of-Order:                 No
    Profiling :                     Yes
  Platform ID:                     0x7fca63c79880
  Name:                         Cypress
  Vendor:                     Advanced Micro Devices, Inc.
  Driver version:                 CAL 1.4.900
  Profile:                     FULL_PROFILE
  Version:                     OpenCL 1.1 ATI-Stream-v2.3 (451)
  Extensions:                     cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt


  Device Type:                     CL_DEVICE_TYPE_CPU
  Device ID:                     4098
  Max compute units:                 4
  Max work items dimensions:             3
    Max work items[0]:                 1024
    Max work items[1]:                 1024
    Max work items[2]:                 1024
  Max work group size:                 1024
  Preferred vector width char:             16
  Preferred vector width short:             8
  Preferred vector width int:             4
  Preferred vector width long:             2
  Preferred vector width float:             4
  Preferred vector width double:         0
  Native vector width char:             16
  Native vector width short:             8
  Native vector width int:             4
  Native vector width long:             2
  Native vector width float:             4
  Native vector width double:             0
  Max clock frequency:                 2400Mhz
  Address bits:                     64
  Max memory allocation:             1073741824
  Image support:                 No
  Max size of kernel argument:             4096
  Alignment (bits) of base address:         1024
  Minimum alignment (bytes) for any datatype:     128
  Single precision floating point capability
    Denorms:                     Yes
    Quiet NaNs:                     Yes
    Round to nearest even:             Yes
    Round to zero:                 Yes
    Round to +ve and infinity:             Yes
    IEEE754-2008 fused multiply-add:         No
  Cache type:                     Read/Write
  Cache line size:                 64
  Cache size:                     32768
  Global memory size:                 3221225472
  Constant buffer size:                 65536
  Max number of constant args:             8
  Local memory type:                 Global
  Local memory size:                 32768
  Kernel Preferred work group size multiple:     1
  Error correction support:             0
  Unified memory for Host and Device:         1
  Profiling timer resolution:             1
  Device endianess:                 Little
  Available:                     Yes
  Compiler available:                 Yes
  Execution capabilities:                 
    Execute OpenCL kernels:             Yes
    Execute native function:             Yes
  Queue properties:                 
    Out-of-Order:                 No
    Profiling :                     Yes
  Platform ID:                     0x7fca63c79880
  Name:                         Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
  Vendor:                     GenuineIntel
  Driver version:                 2.0
  Profile:                     FULL_PROFILE
  Version:                     OpenCL 1.1 ATI-Stream-v2.3 (451)
  Extensions:                     cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_media_ops cl_amd_popcnt cl_amd_printf

0 Likes

Here is a small test app in C++ to reproduce it:

It hangs at the calclCompile call.

#include <time.h>
#include <cal.h>
#include <calcl.h>
#include <cal_ext.h>
#include <stdio.h>
#include <iostream>
#include <fstream>
#include <string>
#include <cstring>
#include <sstream>

using namespace std;

template <class T>
inline std::string to_string (const T& t)
{
std::stringstream ss;
ss << t;
return ss.str();
}

inline const string operator+(const string & s, int i)
{
    return s + to_string(i);
}

string devicenames[] = {"R600", "RV610", "RV630", "RV670", "R700", "RV770", "RV710", "RV730",
                        "CYPRESS", "JUNIPER", "REDWOOD", "CEDAR", "RESERVED0", "RESERVED1",
                        "WRESTLER", "CAYMAN", "RESERVED2", "BARTS"};

class MyException
{
    public:
    MyException(string msg)
    {
        message = msg;
    }
    const string GetMessage()
    {
        return message;
    }

    private:
    string message;
};

void logDisassemble(const char* msg)
{
    printf("%s\n", msg);
}

#define CHECK_CAL(A) checkCalCall(A, #A)

void checkCalCall(int result, string op)
{
    printf("%s: %d\n", op.c_str(), result);
    if(result != CAL_RESULT_OK)
    {
        throw MyException(op + ": " + result);
    }
}

int go(int argc, char ** argv);

int main(int argc, char ** argv)
{
       try
    {
        go(argc, argv);
    }
    catch(MyException x)
    {
        cout<<"Exception: "<<x.GetMessage()<<endl;
        return 1;
    }
    cout<<"OK"<<endl;
    return 0;
}

int go(int argc, char ** argv)
{

    int inlen = 8;
    CALuint constants[8] = {0, 0, 0, 0, 0, 0, 0, 0};

    string kernelsrc = "";
    char line[1024];
    int count = 0;
    while (!cin.eof())
    {
        cin.getline(line, 1024);
        kernelsrc = kernelsrc + line + "\n";
    }

    cout<<kernelsrc;

    CHECK_CAL(calInit());

    CALuint version[3];
    CHECK_CAL(calGetVersion(&version[0], &version[1], &version[2]));
    printf("CAL Runtime version %d.%d.%d\n", version[0], version[1], version[2]);

    CALuint clversion[3];
    CHECK_CAL(calclGetVersion(&clversion[0], &clversion[1], &clversion[2]));
    printf("CAL Compiler     version %d.%d.%d\n", version[0], version[1], version[2]);

    CALuint numDevices = 0;
    CHECK_CAL(calDeviceGetCount(&numDevices));

    cout<<"device count: "<<numDevices<<endl;

    if (numDevices < 1)
        throw MyException("no devices");

    CALdeviceinfo info;
    CHECK_CAL(calDeviceGetInfo(&info, 0));

    CALobject object = NULL;
    CALimage image = NULL;
    CHECK_CAL(calclCompile(&object, CAL_LANGUAGE_IL, kernelsrc.c_str(), info.target));
    CHECK_CAL(calclLink(&image, &object, 1));

    printf("///////////////////////////////////////////////////////////////////////\n");
    calclDisassembleObject(&object, &logDisassemble);
    printf("///////////////////////////////////////////////////////////////////////\n");

    CALdevice device = 0;
    CHECK_CAL(calDeviceOpen(&device, 0));  // este es el numero del device

    CALcontext ctx;
    CHECK_CAL(calCtxCreate(&ctx, device));

    CALresource output1Res = 0;
    CHECK_CAL(calResAllocLocal2D(&output1Res, device, 256, 1, CAL_FORMAT_UNSIGNED_INT32_4, CAL_RESALLOC_GLOBAL_BUFFER));

    CALresource constRes = 0;
    CHECK_CAL(calResAllocLocal1D(&constRes, device, inlen, CAL_FORMAT_FLOAT32_1, 0));

    CALuint* constPtr = NULL;
    CALuint constPitch = 0;
    CALmem constMem = 0;
    CHECK_CAL(calResMap((CALvoid**)&constPtr, &constPitch, constRes, 0));
    for (int i =0; i<inlen; i++) {
        constPtr = constants;
    }

    CHECK_CAL(calResUnmap(constRes));
    // Mapping output resource to CPU and initializing values
    void* data1 = NULL;
    // Getting memory handle from resources
    CALmem output1Mem = 0;
    CALuint pitch1 = 0;
    CHECK_CAL(calResMap(&data1, &pitch1, output1Res, 0));
    memset(data1, 0, 256 * sizeof(CALuint) * 4);
    CHECK_CAL(calResUnmap(output1Res));
    // Get memory handles for various resources
    CHECK_CAL(calCtxGetMem(&constMem, ctx, constRes));
    CHECK_CAL(calCtxGetMem(&output1Mem, ctx, output1Res));

    // Creating module using compiled image
    CALmodule module = 0;
    CHECK_CAL(calModuleLoad(&module, ctx, image));
    // Defining symbols in module
    CALfunc func = 0;
    CALname out1Name = 0;
    CALname constName = 0;
    // Defining entry point for the module
    CHECK_CAL(calModuleGetEntry(&func, ctx, module, "main"));
    CHECK_CAL(calModuleGetName(&out1Name, ctx, module, "g[]"));
    CHECK_CAL(calModuleGetName(&constName, ctx, module, "cb0"));
    // Setting input and output buffers
    // used in the kernel
    CHECK_CAL(calCtxSetMem(ctx, out1Name, output1Mem));
    CHECK_CAL(calCtxSetMem(ctx, constName, constMem));
    // Setting domain

    // do kernel calc

    //-----------------------------------------------------------------
    // Executing kernel and waiting for kernel to terminate
    //-----------------------------------------------------------------
    // Event to check completion of the kernel
    CALevent e = 0;
    CALprogramGrid pg;
    pg.func = func;
    pg.gridBlock.width = 64;
    pg.gridBlock.height = 1;
    pg.gridBlock.depth = 1;
    pg.gridSize.width = 4;
    pg.gridSize.height = 1;
    pg.gridSize.depth = 1;
    pg.flags = 0;
    CHECK_CAL(calCtxRunProgramGrid(&e, ctx, &pg));

    // Checking whether the execution of the kernel is complete or not
    while (calCtxIsEventDone(ctx, e) == CAL_RESULT_PENDING);
    // Reading output from output resources
    int *fdata;

    cout<<"-- FEED BEGIN --"<<endl;

    calResMap((CALvoid**)&fdata, &pitch1, output1Res, 0);
    for (int i = 0; i < 1024; ++i)
    {
        printf("%u\n", fdata);
    }
    cout<<"-- FEED END --"<<endl;

    CHECK_CAL(calResUnmap(output1Res));

    // end

    // Unloading the module
    CHECK_CAL(calModuleUnload(ctx, module));
    // Freeing compiled kernel binary
    CHECK_CAL(calclFreeImage(image));
    CHECK_CAL(calclFreeObject(object));
    // Releasing resource from context
    CHECK_CAL(calCtxReleaseMem(ctx, output1Mem));

    // Deallocating resources
    CHECK_CAL(calResFree(output1Res));

    CHECK_CAL(calCtxDestroy(ctx));

    CHECK_CAL(calDeviceClose(device));

    CHECK_CAL(calShutdown());

}

0 Likes

Anyway, I can easily change the algorithm to compute the same desired value successfully, but the compiler shouldn't hang with this.

0 Likes

Hi, I just noticed that the IL code I pasted originally had the constant value 0x1000001 that makes it compile ok, it is with the value 0x1000000 that it doesn't compile:

 

il_cs_2_0
dcl_num_thread_per_group 64
dcl_cb cb0[2]
dcl_literal l0, 5, 10, 0x100, 100
dcl_literal l1, 4, 3, 2, 1
dcl_literal l2, 0x10000, 1000, 0x1000000, 0
mov r0.x, vaTid.x
mov r0.y, vThreadGrpId.x
mov r0.z, vTidInGrp.x
umul r0.w, r0.x, l[0].x
iadd r1, r0.wwww, l[1]
iadd r2, cb0[0].xxxx, r1
iadd r3.x, cb0[0].x, r0.w
umod r4, r2, l[0].yyyy
umod r3.y, r3.x, l[0].y
udiv r5, r2, l[0].yyyy
udiv r3.z, r3.x, l[0].y
umod r6, r5, l[0].yyyy
umod r3.w, r3.z, l[0].y
umul r5, r6, l[0].zzzz
umul r3.z, r3.w, l[0].z
iadd r6, r4, r5
iadd r3.w, r3.y, r3.z
udiv r4, r2, l[0].wwww
udiv r3.y, r3.x, l[0].w
umod r5, r4, l[0].yyyy
umod r3.z, r3.y, l[0].y
umul r4, r5, l[2].xxxx
umul r3.y, r3.z, l[2].x
iadd r5, r6, r4
iadd r3.z, r3.w, r3.y
udiv r4, r2, l[2].yyyy
udiv r3.y, r3.x, l[2].y
umod r6, r4, l[0].yyyy
umod r3.w, r3.y, l[0].y
umul r4, r6, l[2].zzzz
umul r3.y, r3.w, l[2].z
iadd r6, r5, r4
iadd r3.w, r3.z, r3.y
mov g[r0.x], r6
end

0 Likes

Hmm, very curious.

If I add a new literal:

dcl_literal l3, 16, 1000, 24, 0

and change:

umul r4, r6, l[2].zzzz

into:

ishl r4, r6, l3.zzzz

The compiler hangs too.

I noticed that all r3 computations are dead code.

Also I noticed that only the .w component of:

umul r4, r6, l[2].zzzz

is being computed. This is puzzling me at the moment. Overall, for whatever reason, I can come up with a variety of ways of hanging the IL compiler based on your code. None of them are your fault.

If, on the other hand, I try the attached code, compilation is fine. Note that this code "fixes" the final addition. This is a real mess. Again, not your fault.

All my comments are based on testing with SKA 1.7.

il_cs_2_0 dcl_num_thread_per_group 64 dcl_cb cb0[2] dcl_literal l0, 5, 10, 0x100, 100 dcl_literal l1, 4, 3, 2, 1 dcl_literal l2, 0x10000, 1000, 0x1000000, 0 mov r0.x, vaTid.x mov r0.y, vThreadGrpId.x mov r0.z, vTidInGrp.x umul r0.w, r0.x, l[0].x iadd r1, r0.wwww, l[1] iadd r2, cb0[0].xxxx, r1 umod r4, r2, l[0].yyyy udiv r5, r2, l[0].yyyy umod r6, r5, l[0].yyyy umul r5, r6, l[0].zzzz iadd r6, r4, r5 udiv r4, r2, l[0].wwww umod r5, r4, l[0].yyyy umul r4, r5, l[2].xxxx iadd r5, r6, r4 udiv r4, r2, l[2].yyyy umod r6, r4, l[0].yyyy umul r4, r6, l[2].zzzz //iadd r6, r5, r4 iadd r6.x, r5.x, r4.x iadd r6.y, r5.y, r4.y iadd r6.z, r5.z, r4.z iadd r6.w, r5.w, r4.w/**/ mov g[r0.x], r6 end

0 Likes

Thanks Jawed,

Well, the code that I attached is actually only the begining of a larger code, that is why it has some dead code, it was used by some code I deleted. I wanted to attach something small that keeps hanging the compiler rather than +1000 lines of code.

And yes, doing a shift was my first alternativebut it also hangs.

Thanks for your code, I am actually generating the IL from a high-level compiler, so I have to code something that translated to IL won't hang the CAL compiler. It's really frustating, I have hanged the CAL compiler with several codes.

Since the troubling line is multiplying variable by 0x1000000, the simplest solution I could think of is to multiply twice by 2 constants that sum up to 0x1000000 (such as 0xFFFFFF and 1, with 1 I can save a mul), and then add the parts.

It seams that the CAL compiler is trying to do some optimization due to the nature of the constant being 0x1000000 which is 1 << 24, and it hangs in the process. And since I am doing some udiv in between (which have no simple translation to the final ISA code by the CAL compiler) I guess it doesn't help.

0 Likes

Are you using CAL++? If not, perhaps it's worth trying. I have never used it, but it might help you get around some gotchas.

As for the IL compiler being stupid, the only thing I can suggest as a short-term work-around is to use a constant buffer entry for the constant rather than a literal. That way the IL compiler can't do any optimisation. But that depends on your high level tool.

It could be that two or more literals are interacting and the IL compiler is oscillating amongst them in some bizarre evaluation of what is "best".

0 Likes

I haven't try Cal++, I am using my own compiler, I wanted more control over the code generation and some tweaks and it is just fun, but I'll try Cal++ and other tools I have been finding around, looks interesting.

And yes, I am pushing constants, normally I pass a zero in the constant buffer and add it to itseft into a register. And then add this to anything I want not to be optimized. Anyway the CAL compiler do suffle things around, normally it does a good job.

0 Likes