Archives Discussions

llongeri · ‎03-09-2011

CAL compiler hands depending on constant value in a UMUL instruction

Hi,

I have come up with this IL code that hangs the CAL compiler with 100% CPU usage (ati-stream-sdk-v2.3).

The code is a simple computation and uses some constants.

The compilation seems to go fine up to the last 2 umul instructions. Here a register is multiplied by the constant 0x1000000 (L[2].z). the thing is, if I try changing ramdomly the constant value, like to 0x1000001, the compiler works fine.

il_cs_2_0
dcl_num_thread_per_group 64
dcl_cb cb0[2]
dcl_literal l0, 5, 10, 0x100, 100
dcl_literal l1, 4, 3, 2, 1
dcl_literal l2, 0x10000, 1000, 0x1000001, 0
mov r0.x, vaTid.x
mov r0.y, vThreadGrpId.x
mov r0.z, vTidInGrp.x
umul r0.w, r0.x, l[0].x
iadd r1, r0.wwww, l[1]
iadd r2, cb0[0].xxxx, r1
iadd r3.x, cb0[0].x, r0.w
umod r4, r2, l[0].yyyy
umod r3.y, r3.x, l[0].y
udiv r5, r2, l[0].yyyy
udiv r3.z, r3.x, l[0].y
umod r6, r5, l[0].yyyy
umod r3.w, r3.z, l[0].y
umul r5, r6, l[0].zzzz
umul r3.z, r3.w, l[0].z
iadd r6, r4, r5
iadd r3.w, r3.y, r3.z
udiv r4, r2, l[0].wwww
udiv r3.y, r3.x, l[0].w
umod r5, r4, l[0].yyyy
umod r3.z, r3.y, l[0].y
umul r4, r5, l[2].xxxx
umul r3.y, r3.z, l[2].x
iadd r5, r6, r4
iadd r3.z, r3.w, r3.y
udiv r4, r2, l[2].yyyy
udiv r3.y, r3.x, l[2].y
umod r6, r4, l[0].yyyy
umod r3.w, r3.y, l[0].y
umul r4, r6, l[2].zzzz
umul r3.y, r3.w, l[2].z
iadd r6, r5, r4
iadd r3.w, r3.z, r3.y
mov g[r0.x], r6
end

MicahVillmow · ‎03-09-2011

llongeri,
I don't see this issue with our upcoming SDK release with our internal tools. Do you have a small test app that we can use to attempt to reproduce it? Also, can you post the output of CLInfo.exe here so we know what the system setup is?

llongeri · ‎03-09-2011

This is de CLInfo output:

Number of platforms:                1
Platform Profile:                FULL_PROFILE
Platform Version:                OpenCL 1.1 ATI-Stream-v2.3 (451)
Platform Name:                ATI Stream
Platform Vendor:                Advanced Micro Devices, Inc.
Platform Extensions:                cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

Platform Name:                ATI Stream
Number of devices:                2
Device Type:                    CL_DEVICE_TYPE_GPU
Device ID:                    4098
Max compute units:                18
Max work items dimensions:            3
    Max work items[0]:                256
    Max work items[1]:                256
    Max work items[2]:                256
Max work group size:                256
Preferred vector width char:            16
Preferred vector width short:            8
Preferred vector width int:            4
Preferred vector width long:            2
Preferred vector width float:            4
Preferred vector width double:        0
Native vector width char:            0
Native vector width short:            0
Native vector width int:            0
Native vector width long:            0
Native vector width float:            0
Native vector width double:            0
Max clock frequency:                0Mhz
Address bits:                    32
Max memory allocation:            134217728
Image support:                Yes
Max number of images read arguments:        128
Max number of images write arguments:        8
Max image 2D width:                8192
Max image 2D height:                8192
Max image 3D width:                2048
Max image 3D height:                2048
Max image 3D depth:                2048
Max samplers within kernel:            16
Max size of kernel argument:            1024
Alignment (bits) of base address:        32768
Minimum alignment (bytes) for any datatype:    128
Single precision floating point capability
    Denorms:                    No
    Quiet NaNs:                    Yes
    Round to nearest even:            Yes
    Round to zero:                Yes
    Round to +ve and infinity:            Yes
    IEEE754-2008 fused multiply-add:        Yes
Cache type:                    None
Cache line size:                0
Cache size:                    0
Global memory size:                536870912
Constant buffer size:                65536
Max number of constant args:            8
Local memory type:                Scratchpad
Local memory size:                32768
Kernel Preferred work group size multiple:    64
Error correction support:            0
Unified memory for Host and Device:        0
Profiling timer resolution:            1
Device endianess:                Little
Available:                    Yes
Compiler available:                Yes
Execution capabilities:
    Execute OpenCL kernels:            Yes
    Execute native function:            No
Queue properties:
    Out-of-Order:                No
    Profiling :                    Yes
Platform ID:                    0x7fca63c79880
Name:                        Cypress
Vendor:                    Advanced Micro Devices, Inc.
Driver version:                CAL 1.4.900
Profile:                    FULL_PROFILE
Version:                    OpenCL 1.1 ATI-Stream-v2.3 (451)
Extensions:                    cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt

Device Type:                    CL_DEVICE_TYPE_CPU
Device ID:                    4098
Max compute units:                4
Max work items dimensions:            3
    Max work items[0]:                1024
    Max work items[1]:                1024
    Max work items[2]:                1024
Max work group size:                1024
Preferred vector width char:            16
Preferred vector width short:            8
Preferred vector width int:            4
Preferred vector width long:            2
Preferred vector width float:            4
Preferred vector width double:        0
Native vector width char:            16
Native vector width short:            8
Native vector width int:            4
Native vector width long:            2
Native vector width float:            4
Native vector width double:            0
Max clock frequency:                2400Mhz
Address bits:                    64
Max memory allocation:            1073741824
Image support:                No
Max size of kernel argument:            4096
Alignment (bits) of base address:        1024
Minimum alignment (bytes) for any datatype:    128
Single precision floating point capability
    Denorms:                    Yes
    Quiet NaNs:                    Yes
    Round to nearest even:            Yes
    Round to zero:                Yes
    Round to +ve and infinity:            Yes
    IEEE754-2008 fused multiply-add:        No
Cache type:                    Read/Write
Cache line size:                64
Cache size:                    32768
Global memory size:                3221225472
Constant buffer size:                65536
Max number of constant args:            8
Local memory type:                Global
Local memory size:                32768
Kernel Preferred work group size multiple:    1
Error correction support:            0
Unified memory for Host and Device:        1
Profiling timer resolution:            1
Device endianess:                Little
Available:                    Yes
Compiler available:                Yes
Execution capabilities:
    Execute OpenCL kernels:            Yes
    Execute native function:            Yes
Queue properties:
    Out-of-Order:                No
    Profiling :                    Yes
Platform ID:                    0x7fca63c79880
Name:                        Intel(R) Core(TM)2 Quad CPU    Q6600 @ 2.40GHz
Vendor:                    GenuineIntel
Driver version:                2.0
Profile:                    FULL_PROFILE
Version:                    OpenCL 1.1 ATI-Stream-v2.3 (451)
Extensions:                    cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_media_ops cl_amd_popcnt cl_amd_printf

llongeri · ‎03-09-2011

Here is a small test app in C++ to reproduce it:

It hangs at the calclCompile call.

#include <time.h>
#include <cal.h>
#include <calcl.h>
#include <cal_ext.h>
#include <stdio.h>
#include <iostream>
#include <fstream>
#include <string>
#include <cstring>
#include <sstream>

using namespace std;

template <class T>
inline std::string to_string (const T& t)
{
std::stringstream ss;
ss << t;
return ss.str();
}

inline const string operator+(const string & s, int i)
{
    return s + to_string(i);
}

string devicenames[] = {"R600", "RV610", "RV630", "RV670", "R700", "RV770", "RV710", "RV730",
                        "CYPRESS", "JUNIPER", "REDWOOD", "CEDAR", "RESERVED0", "RESERVED1",
                        "WRESTLER", "CAYMAN", "RESERVED2", "BARTS"};

class MyException
{
    public:
    MyException(string msg)
    {
        message = msg;
    }
    const string GetMessage()
    {
        return message;
    }

    private:
    string message;
};

void logDisassemble(const char* msg)
{
    printf("%s\n", msg);
}

#define CHECK_CAL(A) checkCalCall(A, #A)

void checkCalCall(int result, string op)
{
    printf("%s: %d\n", op.c_str(), result);
    if(result != CAL_RESULT_OK)
    {
        throw MyException(op + ": " + result);
    }
}

int go(int argc, char ** argv);

int main(int argc, char ** argv)
{
       try
    {
        go(argc, argv);
    }
    catch(MyException x)
    {
        cout<<"Exception: "<<x.GetMessage()<<endl;
        return 1;
    }
    cout<<"OK"<<endl;
    return 0;
}

int go(int argc, char ** argv)
{

    int inlen = 8;
    CALuint constants[8] = {0, 0, 0, 0, 0, 0, 0, 0};

    string kernelsrc = "";
    char line[1024];
    int count = 0;
    while (!cin.eof())
    {
        cin.getline(line, 1024);
        kernelsrc = kernelsrc + line + "\n";
    }

    cout<<kernelsrc;

    CHECK_CAL(calInit());

    CALuint version[3];
    CHECK_CAL(calGetVersion(&version[0], &version[1], &version[2]));
    printf("CAL Runtime version %d.%d.%d\n", version[0], version[1], version[2]);

    CALuint clversion[3];
    CHECK_CAL(calclGetVersion(&clversion[0], &clversion[1], &clversion[2]));
    printf("CAL Compiler    version %d.%d.%d\n", version[0], version[1], version[2]);

    CALuint numDevices = 0;
    CHECK_CAL(calDeviceGetCount(&numDevices));

    cout<<"device count: "<<numDevices<<endl;

    if (numDevices < 1)
        throw MyException("no devices");

    CALdeviceinfo info;
    CHECK_CAL(calDeviceGetInfo(&info, 0));

    CALobject object = NULL;
    CALimage image = NULL;
    CHECK_CAL(calclCompile(&object, CAL_LANGUAGE_IL, kernelsrc.c_str(), info.target));
    CHECK_CAL(calclLink(&image, &object, 1));

    printf("///////////////////////////////////////////////////////////////////////\n");
    calclDisassembleObject(&object, &logDisassemble);
    printf("///////////////////////////////////////////////////////////////////////\n");

    CALdevice device = 0;
    CHECK_CAL(calDeviceOpen(&device, 0)); // este es el numero del device

    CALcontext ctx;
    CHECK_CAL(calCtxCreate(&ctx, device));

    CALresource output1Res = 0;
    CHECK_CAL(calResAllocLocal2D(&output1Res, device, 256, 1, CAL_FORMAT_UNSIGNED_INT32_4, CAL_RESALLOC_GLOBAL_BUFFER));

    CALresource constRes = 0;
    CHECK_CAL(calResAllocLocal1D(&constRes, device, inlen, CAL_FORMAT_FLOAT32_1, 0));

    CALuint* constPtr = NULL;
    CALuint constPitch = 0;
    CALmem constMem = 0;
    CHECK_CAL(calResMap((CALvoid**)&constPtr, &constPitch, constRes, 0));
    for (int i =0; i<inlen; i++) {
        constPtr = constants;
    }

    CHECK_CAL(calResUnmap(constRes));
    // Mapping output resource to CPU and initializing values
    void* data1 = NULL;
    // Getting memory handle from resources
    CALmem output1Mem = 0;
    CALuint pitch1 = 0;
    CHECK_CAL(calResMap(&data1, &pitch1, output1Res, 0));
    memset(data1, 0, 256 * sizeof(CALuint) * 4);
    CHECK_CAL(calResUnmap(output1Res));
    // Get memory handles for various resources
    CHECK_CAL(calCtxGetMem(&constMem, ctx, constRes));
    CHECK_CAL(calCtxGetMem(&output1Mem, ctx, output1Res));

    // Creating module using compiled image
    CALmodule module = 0;
    CHECK_CAL(calModuleLoad(&module, ctx, image));
    // Defining symbols in module
    CALfunc func = 0;
    CALname out1Name = 0;
    CALname constName = 0;
    // Defining entry point for the module
    CHECK_CAL(calModuleGetEntry(&func, ctx, module, "main"));
    CHECK_CAL(calModuleGetName(&out1Name, ctx, module, "g[]"));
    CHECK_CAL(calModuleGetName(&constName, ctx, module, "cb0"));
    // Setting input and output buffers
    // used in the kernel
    CHECK_CAL(calCtxSetMem(ctx, out1Name, output1Mem));
    CHECK_CAL(calCtxSetMem(ctx, constName, constMem));
    // Setting domain

    // do kernel calc

    //-----------------------------------------------------------------
    // Executing kernel and waiting for kernel to terminate
    //-----------------------------------------------------------------
    // Event to check completion of the kernel
    CALevent e = 0;
    CALprogramGrid pg;
    pg.func = func;
    pg.gridBlock.width = 64;
    pg.gridBlock.height = 1;
    pg.gridBlock.depth = 1;
    pg.gridSize.width = 4;
    pg.gridSize.height = 1;
    pg.gridSize.depth = 1;
    pg.flags = 0;
    CHECK_CAL(calCtxRunProgramGrid(&e, ctx, &pg));

    // Checking whether the execution of the kernel is complete or not
    while (calCtxIsEventDone(ctx, e) == CAL_RESULT_PENDING);
    // Reading output from output resources
    int *fdata;

    cout<<"-- FEED BEGIN --"<<endl;

    calResMap((CALvoid**)&fdata, &pitch1, output1Res, 0);
    for (int i = 0; i < 1024; ++i)
    {
        printf("%u\n", fdata);
    }
    cout<<"-- FEED END --"<<endl;

    CHECK_CAL(calResUnmap(output1Res));

    // end

    // Unloading the module
    CHECK_CAL(calModuleUnload(ctx, module));
    // Freeing compiled kernel binary
    CHECK_CAL(calclFreeImage(image));
    CHECK_CAL(calclFreeObject(object));
    // Releasing resource from context
    CHECK_CAL(calCtxReleaseMem(ctx, output1Mem));

    // Deallocating resources
    CHECK_CAL(calResFree(output1Res));

    CHECK_CAL(calCtxDestroy(ctx));

    CHECK_CAL(calDeviceClose(device));

    CHECK_CAL(calShutdown());

}

llongeri · ‎03-10-2011

Anyway, I can easily change the algorithm to compute the same desired value successfully, but the compiler shouldn't hang with this.

llongeri · ‎03-11-2011

Hi, I just noticed that the IL code I pasted originally had the constant value 0x1000001 that makes it compile ok, it is with the value 0x1000000 that it doesn't compile:

il_cs_2_0
dcl_num_thread_per_group 64
dcl_cb cb0[2]
dcl_literal l0, 5, 10, 0x100, 100
dcl_literal l1, 4, 3, 2, 1
dcl_literal l2, 0x10000, 1000, 0x1000000, 0
mov r0.x, vaTid.x
mov r0.y, vThreadGrpId.x
mov r0.z, vTidInGrp.x
umul r0.w, r0.x, l[0].x
iadd r1, r0.wwww, l[1]
iadd r2, cb0[0].xxxx, r1
iadd r3.x, cb0[0].x, r0.w
umod r4, r2, l[0].yyyy
umod r3.y, r3.x, l[0].y
udiv r5, r2, l[0].yyyy
udiv r3.z, r3.x, l[0].y
umod r6, r5, l[0].yyyy
umod r3.w, r3.z, l[0].y
umul r5, r6, l[0].zzzz
umul r3.z, r3.w, l[0].z
iadd r6, r4, r5
iadd r3.w, r3.y, r3.z
udiv r4, r2, l[0].wwww
udiv r3.y, r3.x, l[0].w
umod r5, r4, l[0].yyyy
umod r3.z, r3.y, l[0].y
umul r4, r5, l[2].xxxx
umul r3.y, r3.z, l[2].x
iadd r5, r6, r4
iadd r3.z, r3.w, r3.y
udiv r4, r2, l[2].yyyy
udiv r3.y, r3.x, l[2].y
umod r6, r4, l[0].yyyy
umod r3.w, r3.y, l[0].y
umul r4, r6, l[2].zzzz
umul r3.y, r3.w, l[2].z
iadd r6, r5, r4
iadd r3.w, r3.z, r3.y
mov g[r0.x], r6
end

Jawed · ‎03-15-2011

Hmm, very curious.

If I add a new literal:

dcl_literal l3, 16, 1000, 24, 0

and change:

umul r4, r6, l[2].zzzz

into:

ishl r4, r6, l3.zzzz

The compiler hangs too.

I noticed that all r3 computations are dead code.

Also I noticed that only the .w component of:

umul r4, r6, l[2].zzzz

is being computed. This is puzzling me at the moment. Overall, for whatever reason, I can come up with a variety of ways of hanging the IL compiler based on your code. None of them are your fault.

If, on the other hand, I try the attached code, compilation is fine. Note that this code "fixes" the final addition. This is a real mess. Again, not your fault.

All my comments are based on testing with SKA 1.7.

il_cs_2_0 dcl_num_thread_per_group 64 dcl_cb cb0[2] dcl_literal l0, 5, 10, 0x100, 100 dcl_literal l1, 4, 3, 2, 1 dcl_literal l2, 0x10000, 1000, 0x1000000, 0 mov r0.x, vaTid.x mov r0.y, vThreadGrpId.x mov r0.z, vTidInGrp.x umul r0.w, r0.x, l[0].x iadd r1, r0.wwww, l[1] iadd r2, cb0[0].xxxx, r1 umod r4, r2, l[0].yyyy udiv r5, r2, l[0].yyyy umod r6, r5, l[0].yyyy umul r5, r6, l[0].zzzz iadd r6, r4, r5 udiv r4, r2, l[0].wwww umod r5, r4, l[0].yyyy umul r4, r5, l[2].xxxx iadd r5, r6, r4 udiv r4, r2, l[2].yyyy umod r6, r4, l[0].yyyy umul r4, r6, l[2].zzzz //iadd r6, r5, r4 iadd r6.x, r5.x, r4.x iadd r6.y, r5.y, r4.y iadd r6.z, r5.z, r4.z iadd r6.w, r5.w, r4.w/**/ mov g[r0.x], r6 end

llongeri · ‎03-19-2011

Thanks Jawed,

Well, the code that I attached is actually only the begining of a larger code, that is why it has some dead code, it was used by some code I deleted. I wanted to attach something small that keeps hanging the compiler rather than +1000 lines of code.

And yes, doing a shift was my first alternativebut it also hangs.

Thanks for your code, I am actually generating the IL from a high-level compiler, so I have to code something that translated to IL won't hang the CAL compiler. It's really frustating, I have hanged the CAL compiler with several codes.

Since the troubling line is multiplying variable by 0x1000000, the simplest solution I could think of is to multiply twice by 2 constants that sum up to 0x1000000 (such as 0xFFFFFF and 1, with 1 I can save a mul), and then add the parts.

It seams that the CAL compiler is trying to do some optimization due to the nature of the constant being 0x1000000 which is 1 << 24, and it hangs in the process. And since I am doing some udiv in between (which have no simple translation to the final ISA code by the CAL compiler) I guess it doesn't help.

Jawed · ‎03-21-2011

Are you using CAL++? If not, perhaps it's worth trying. I have never used it, but it might help you get around some gotchas.

As for the IL compiler being stupid, the only thing I can suggest as a short-term work-around is to use a constant buffer entry for the constant rather than a literal. That way the IL compiler can't do any optimisation. But that depends on your high level tool.

It could be that two or more literals are interacting and the IL compiler is oscillating amongst them in some bizarre evaluation of what is "best".

llongeri · ‎03-22-2011

I haven't try Cal++, I am using my own compiler, I wanted more control over the code generation and some tweaks and it is just fun, but I'll try Cal++ and other tools I have been finding around, looks interesting.

And yes, I am pushing constants, normally I pass a zero in the constant buffer and add it to itseft into a register. And then add this to anything I want not to be optimized. Anyway the CAL compiler do suffle things around, normally it does a good job.

Archives Discussions

CAL compilation hanging