13 Replies Latest reply on Nov 15, 2010 4:19 AM by himanshu.gautam

    clBuildProgram() hangs

    AntonZherzdev

      I'm having a problem compiling my CL code. The whole code is too big to post it here but my experiments narrowed it all down to a rather ridiculous situation.

      I have some pixels being processed in a global buffer (each pixel being an uint). Whenever I want to write a pixel to that buffer I call a function

       

      void Write(__global uint *where, uint value)

      {

        *where = value;

      }

       

      When I call clBuildProgram() it just never exits.

      Now, the funny part. If I replace this function's body with

      *where = 0;

      clBuildProgram() returns with success.

      The only possibility I can think about is that CL compiler builds some kind of data-dependency graph which it then processes incorrectly (falling into infinite loop or maybe just exponential time).

      Anyone have ideas?

       

      Additional info:

      1. My card is HD 4800. I'm using latest drivers and 2.2 Stream SDK.

      2. I can provide full code if this helps.

      3. I've also had various misterious crashes while investigating.

        • clBuildProgram() hangs
          AntonZherzdev

          My latest experiments show that this behaviour is exhibited with high probability in a situation where I have circular data-dependency in my code (global buffer being the target).

          That is, if I pass (__global uint *) to my kernel and then try to write into memory pointed by it something that was calculated using the same memory as source, compiler hangs (or crashes).

          Consider the following example (I haven't tried to compile it but it illustrates my idea in general). Note that any number of intermediate calculations can happen between reading and writing (including function calls etc.). It is data dependency that matters.

           

          __kernel void Kernel(__global uint *p) { uint a = *p + 1; *p = a; }

          • clBuildProgram() hangs
            Raistmer
            Do you make indexing into p array with thread ID? In your sample all threads will read and write just into the same location.
              • clBuildProgram() hangs
                himanshu.gautam

                anton,

                Kindly post the code you are working on.

                 

                • clBuildProgram() hangs
                  AntonZherzdev

                  Raistmer,

                  I'm targeting for only one thread now. Anyway I'm not able to actuallly run my program because compiler hangs.

                    • clBuildProgram() hangs
                      AntonZherzdev

                      Here's  the code I'm compiling. Sorry for the size. I wasn't able to come up with a smaller test case. There's a block of code there marked with a comment: "comment this block out to prevent compiler hang-up". Guess what? If you comment it out compiler won't hang. This code is trying to perform some calculations on the global buffer.

                      Also, here's the command line which clBuildProgram() uses to compile this code (ripped it out with Process Monitor):

                      clc --emit=llvmbc --opencl=1.0 -D__ATI_RV770__=1 -D__GPU__=1 -Dcl_khr_icd=1 -Dcl_amd_fp64=1 -Dcl_khr_gl_sharing=1 -Dcl_amd_device_attribute_query=1 -Dcl_amd_vec3=1 -Dcl_amd_printf=1

                      /////////////////////////////////////////////////////////////////////////////// //! @param data data in global memory /////////////////////////////////////////////////////////////////////////////// __kernel void YUV2RGB(__global /*__constant*/ uchar4* inputImage, __global uchar4* outputImage, unsigned int width, unsigned int height) { /*__constant*/ __global uchar4* pY = inputImage + get_global_id(0); /*__constant*/ __global uchar4* pU = inputImage + width * height / 4; /*__constant*/ __global uchar4* pV = inputImage + width * height * 5 / 16; unsigned int w4 = width / 4; unsigned int nShift = convert_int_sat(get_global_id(0) / 2); unsigned int nDiv = convert_int_sat(get_global_id(0) / w4); if((nDiv & 1) == 1) { nShift -= w4 * ((nDiv + 1) / 2) / 2; } else { nShift -= w4 * (nDiv / 2) / 2; } pU += nShift; pV += nShift; unsigned int pos = get_global_id(0) * 3; int4 YPixels = (int4)((*pY).x, (*pY).y, (*pY).z, (*pY).w); int4 UPixels = (int4)((*pU).x, (*pU).y, (*pU).z, (*pU).w); int4 VPixels = (int4)((*pV).x, (*pV).y, (*pV).z, (*pV).w); if((get_global_id(0) & 1) == 0) { //propagate U&V UPixels.w = UPixels.y; UPixels.z = UPixels.y; UPixels.y = UPixels.x; VPixels.w = VPixels.y; VPixels.z = VPixels.y; VPixels.y = VPixels.x; } else { //propagate U&V UPixels.x = UPixels.w; UPixels.y = UPixels.w; UPixels.w = UPixels.z; UPixels.x = UPixels.w; UPixels.y = UPixels.w; UPixels.w = UPixels.z; } int4 Ri = (1164*(YPixels - 16) + 1596*(VPixels - 128) + 500)/1000; int4 Gi = (1164*(YPixels - 16) - 813*(VPixels - 128) - 391*(UPixels - 128) + 500)/1000; int4 Bi = (1164*(YPixels - 16) + 2018*(UPixels - 128) + 500)/1000; uchar4 R = convert_uchar4_sat(Ri); uchar4 G = convert_uchar4_sat(Gi); uchar4 B = convert_uchar4_sat(Bi); outputImage[pos] = (uchar4)(R.x, G.x, B.x, R.y); outputImage[pos + 1] = (uchar4)(G.y, B.y, R.z, G.z); outputImage[pos + 2] = (uchar4)(B.z, R.w, G.w, B.w); } __kernel void TESTKRNL(unsigned int a, unsigned int b, __global unsigned int *c) { *c = a + b; } #ifndef DI_CLSTRUCT_H #define DI_CLSTRUCT_H #define MB_BUFFER_WIDTH 24 typedef struct _TCoeff { int m_iCoeffValue; int m_iLevelValue; int m_sPred; // short in structures is not supported by OpenCL } TCoeff; typedef struct _MbTransformCoeffs { TCoeff m_aaiLevel[24][16]; unsigned char m_aaucCoeffCount[24]; } MbTransformCoeffs; typedef struct _Transform // incomplete copy { int m_bClip; } Transform; typedef struct _YuvMbBuffer { // unsigned short m_aucYuvBuffer[MB_BUFFER_WIDTH * (29+1)]; unsigned int m_aucYuvBuffer[MB_BUFFER_WIDTH * (29+1)]; } YuvMbBuffer; typedef struct _IntraPrediction // incomplete copy { unsigned m_uiAvailableMaskMb; unsigned m_uiAvailable; } IntraPrediction; #endif // DI_CLSTRUCT_H // #pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable #define BUFPTR __global //#define BUFPTR /* void WriteShort(__global ushort *where, uint value) { __global uint *aligned = (__global uint*)((uintptr_t)where & 0xFFFFFFFC); if ( (uintptr_t)where & 0x03 ) { *aligned = ((*aligned) & 0x0000FFFF) | (value << 16); } else { *aligned = ((*aligned) & 0xFFFF0000) | (value & 0xFFFF); } } */ void WriteShort(BUFPTR uint *where, uint value) { *where = value; } #define ROFVS( exp ) \ { \ if( !( exp ) ) \ { \ return; \ } \ } #define RNOK( exp ) \ { \ const int nMSysRetVal = ( exp ); \ if( nMSysRetVal != 0 ) \ { \ return nMSysRetVal; \ } \ } #define RNOKVS( exp ) \ { \ if( 0 != ( exp ) ) \ { \ return; \ } \ } #define AOF( exp ) (exp) int gClip( const int iX ) { const int i2 = (iX & 0xFF); if( i2 == iX ) { return iX; } if( iX < 0 ) { return 0x00; } else { return 0xFF; } } int B4x4IdxIsLegal(int idx) { return idx < 16; } void B4x4IdxInc(int *idx) { (*idx)++; } int CIdxIsLegal(int idx) { return idx < 8; } void CIdxInc(int *idx) { (*idx)++; } /****************************************************************************************************** * TCoeff ******************************************************************************************************/ int TCoeffToInt(__global TCoeff *this) { return this->m_iCoeffValue; } void TCoeffConstructor(__global TCoeff *this, int iVal) { this->m_iCoeffValue = iVal; this->m_iLevelValue = iVal; // this->m_sPred = 0; } void TCoeffIncBy(__global TCoeff *this, int iVal) { this->m_iCoeffValue += iVal; } /****************************************************************************************************** * MbTransformCoeffs ******************************************************************************************************/ __global TCoeff* MbTransformCoeffsGetLuma( __global MbTransformCoeffs *this, int cLumaIdx ) { return &this->m_aaiLevel[cLumaIdx][0]; } __global TCoeff* MbTransformCoeffsGetChroma( __global MbTransformCoeffs *this, int cChromaIdx ) { return &this->m_aaiLevel[16+cChromaIdx][0]; } void MbTransformCoeffsAdd( __global MbTransformCoeffs *this, __global MbTransformCoeffs* pcCoeffs, int bLuma, int bChroma ) { if( bLuma ) { for( int bIdx = 0; B4x4IdxIsLegal(bIdx); B4x4IdxInc(&bIdx) ) { __global TCoeff* piCoeff = MbTransformCoeffsGetLuma( this, bIdx ); __global TCoeff* piSrcCoeff = MbTransformCoeffsGetLuma( pcCoeffs, bIdx ); for( unsigned ui=0; ui<16; ui++ ) { TCoeffIncBy( &piCoeff[ui], TCoeffToInt( &piSrcCoeff[ui] ) ); } } } if( bChroma ) { for( int cIdx = 0; CIdxIsLegal(cIdx); CIdxInc(&cIdx) ) { __global TCoeff* piCoeff = MbTransformCoeffsGetChroma( this, cIdx ); __global TCoeff* piSrcCoeff = MbTransformCoeffsGetChroma( pcCoeffs, cIdx ); for( unsigned ui=0; ui<16; ui++ ) { TCoeffIncBy( &piCoeff[ui], TCoeffToInt( &piSrcCoeff[ui] ) ); } } } } /****************************************************************************************************** * Transform ******************************************************************************************************/ // Transform:: int xRound ( int i ) { return ((i)+(1<<5))>>6; } int TransformXClip ( __global Transform* this, int iPel ) { return ( this->m_bClip ? gClip( iPel ) : iPel); } void TransformInvTransformChromaDc( __global Transform* this, __global TCoeff* piCoeff ) { int tmp1, tmp2; int d00, d01, d10, d11; d00 = TCoeffToInt(&piCoeff[0]); d10 = TCoeffToInt(&piCoeff[32]); d01 = TCoeffToInt(&piCoeff[16]); d11 = TCoeffToInt(&piCoeff[48]); tmp1 = d00 + d11; tmp2 = d10 + d01; TCoeffConstructor( &piCoeff[0], ( tmp1 + tmp2 ) >> 5 ); TCoeffConstructor( &piCoeff[48], ( tmp1 - tmp2 ) >> 5 ); tmp1 = d00 - d11; tmp2 = d01 - d10; TCoeffConstructor( &piCoeff[32], ( tmp1 + tmp2 ) >> 5 ); TCoeffConstructor( &piCoeff[16], ( tmp1 - tmp2 ) >> 5 ); } void TransformXInvTransform4x4Blk( __global Transform* this, BUFPTR uint* puc, int iStride, __global TCoeff *piCoeff ) { int aai[4][4]; int tmp1, tmp2; int x, y; int iStride2 = 2*iStride; int iStride3 = 3*iStride; for( x = 0; x < 4; x++, piCoeff+=4 ) { tmp1 = TCoeffToInt( &piCoeff[0] ) + TCoeffToInt( &piCoeff[2] ); tmp2 = (TCoeffToInt( &piCoeff[3] )>>1) + TCoeffToInt( &piCoeff[1]); aai[0][x] = tmp1 + tmp2; aai[3][x] = tmp1 - tmp2; tmp1 = TCoeffToInt( &piCoeff[0] ) - TCoeffToInt( &piCoeff[2] ); tmp2 = (TCoeffToInt( &piCoeff[1] )>>1) - TCoeffToInt( &piCoeff[3] ); aai[1][x] = tmp1 + tmp2; aai[2][x] = tmp1 - tmp2; } for( y = 0; y < 4; y++, puc++ ) { tmp1 = aai[y][0] + aai[y][2]; tmp2 = (aai[y][3]>>1) + aai[y][1]; // comment this block out to prevent compiler hang-up { WriteShort( puc, TransformXClip( this, xRound( tmp1 + tmp2 ) + puc[0] ) ); WriteShort( puc + iStride3, TransformXClip( this, xRound( tmp1 - tmp2 ) + puc[iStride3] ) ); tmp1 = aai[y][0] - aai[y][2]; tmp2 = (aai[y][1]>>1) - aai[y][3]; WriteShort( puc + iStride, TransformXClip( this, xRound( tmp1 + tmp2) + puc[iStride] ) ); WriteShort( puc + iStride2, TransformXClip( this, xRound( tmp1 - tmp2) + puc[iStride2] ) ); } } } void TransformInvTransformChromaBlocks( __global Transform* this, BUFPTR uint* puc, int iStride, __global TCoeff* piCoeff ) { TransformXInvTransform4x4Blk( this, puc, iStride, piCoeff + 0x00 ); TransformXInvTransform4x4Blk( this, puc + 4, iStride, piCoeff + 0x10 ); puc += iStride << 2; TransformXInvTransform4x4Blk( this, puc, iStride, piCoeff + 0x20 ); TransformXInvTransform4x4Blk( this, puc + 4, iStride, piCoeff + 0x30 ); } /****************************************************************************************************** * YuvMbBuffer ******************************************************************************************************/ #define OFFSET 19 BUFPTR uint* YuvMbBufferGetMbCbAddr( BUFPTR YuvMbBuffer *this ) { return &this->m_aucYuvBuffer[OFFSET*MB_BUFFER_WIDTH + 4]; } BUFPTR uint* YuvMbBufferGetMbCrAddr( BUFPTR YuvMbBuffer *this ) { return &this->m_aucYuvBuffer[OFFSET*MB_BUFFER_WIDTH + 16]; } void YuvMbBufferLoadChroma( BUFPTR YuvMbBuffer *this, BUFPTR YuvMbBuffer *pcSrcBuffer ) { const int iStride = MB_BUFFER_WIDTH; BUFPTR uint* pDes = YuvMbBufferGetMbCbAddr(this); BUFPTR const uint* pSrc = YuvMbBufferGetMbCbAddr(pcSrcBuffer); int y, x; for( y = 0; y < 8; y++ ) { for( x = 0; x < 8; x++ ) { WriteShort( &pDes[x], pSrc[x] ); } pDes += iStride; pSrc += iStride; } pDes = YuvMbBufferGetMbCrAddr(this); pSrc = YuvMbBufferGetMbCrAddr(pcSrcBuffer); for( y = 0; y < 8; y++ ) { for( x = 0; x < 8; x++ ) { WriteShort( &pDes[x], pSrc[x] ); } pDes += iStride; pSrc += iStride; } } /****************************************************************************************************** * IntraPrediction ******************************************************************************************************/ // IntraPrediction:: unsigned xGetS0( BUFPTR uint* puc, int iStride ) { puc -= iStride; return as_uint(puc[0] + puc[1] + puc[2] + puc[3]); } // IntraPrediction:: unsigned xGetS1( BUFPTR uint* puc, int iStride ) { puc -= iStride; return as_uint(puc[4] + puc[5] + puc[6] + puc[7]); } // IntraPrediction:: unsigned xGetS2( BUFPTR uint* puc, int iStride ) { puc--; return as_uint(puc[0] + puc[iStride] + puc[2*iStride] + puc[3*iStride]); } // IntraPrediction:: unsigned xGetS3( BUFPTR uint* puc, int iStride ) { puc += 4 * iStride - 1; return as_uint(puc[0] + puc[iStride] + puc[2*iStride] + puc[3*iStride]); } int IntraPredictionXIsAboveRef( __global IntraPrediction *this ) { return ( this->m_uiAvailable & 0x2 ) == 0; } int IntraPredictionXIsLeftRef( __global IntraPrediction *this ) { return ( this->m_uiAvailable & 0x1 ) == 0; } int IntraPredictionXIsAllLeftAboveRef( __global IntraPrediction *this ) { return ( this->m_uiAvailable & 0x7 ) == 0; } // DC prediction void IntraPredictionXPred8x8IMode0DC( __global IntraPrediction *this, BUFPTR uint* puc, int iStride ) { unsigned uiA, uiB, uiC, uiD; if( ! IntraPredictionXIsAboveRef(this) ) { if( this->m_uiAvailableMaskMb & 0x1 ) // top { uiA = uiB = 0x80; } else { uiA = uiB = (xGetS2( puc, iStride ) + 2) / 4; } if( this->m_uiAvailableMaskMb & 0x10 ) // bot { uiC = uiD = 0x80; } else { uiC = uiD = (xGetS3( puc, iStride ) + 2) / 4; } } else { unsigned uiS0 = xGetS0( puc, iStride ); unsigned uiS1 = xGetS1( puc, iStride ); if( this->m_uiAvailableMaskMb & 0x1 ) // top { uiA = (uiS0 + 2)/4; uiB = (uiS1 + 2)/4; } else { unsigned uiS2 = xGetS2( puc, iStride ); uiA = (uiS0 + uiS2 + 4)/8; uiB = (uiS1 + 2)/4; } if( this->m_uiAvailableMaskMb & 0x10 ) // bot { uiC = (uiS0 + 2)/4; uiD = (uiS1 + 2)/4; } else { unsigned uiS3 = xGetS3( puc, iStride ); uiC = (uiS3 + 2)/4; uiD = (uiS1 + uiS3 + 4)/8; } } int pos; BUFPTR uint* pucDes = puc; for( pos = 0; pos < 4; pos++) { WriteShort( &pucDes[pos], uiA ); WriteShort( &pucDes[pos + 4], uiB ); } for( int n1 = 0; n1 < 3; n1 ++ ) { for (int i = 0; i < 8; ++i) { WriteShort( &pucDes[iStride + i], pucDes[i] ); } pucDes += iStride; } pucDes += iStride; for( pos = 0; pos < 4; pos++) { WriteShort( &pucDes[pos], uiC ); WriteShort( &pucDes[pos + 4], uiD ); } for( int n2 = 0; n2 < 3; n2 ++ ) { for (int i = 0; i < 8; ++i) { WriteShort( &pucDes[iStride + i], pucDes[i] ); } pucDes += iStride; } } // horizontal void IntraPredictionXPred8x8IMode1Hori( __global IntraPrediction *this, BUFPTR uint* puc, int iStride ) { AOF( IntraPredictionXIsLeftRef(this) ); for( int n = 0; n < 8; n++ ) { for( int m = 0; m < 8; m++ ) { WriteShort( &puc[m], puc[m-1] ); } puc += iStride; } } // vertical void IntraPredictionXPred8x8IMode2Vert( __global IntraPrediction *this, BUFPTR uint* puc, int iStride ) { AOF( IntraPredictionXIsAboveRef(this) ); for( int n = 0; n < 8; n++ ) { for( int i = 0; i < 8; i++ ) { WriteShort( &puc[i], *(puc - iStride + i) ); } puc += iStride; } } // plane prediction void IntraPredictionXPred8x8IMode3Plane( __global IntraPrediction *this, BUFPTR uint* puc, int iStride ) { int n, m; int iH = 0; int iV = 0; AOF( IntraPredictionXIsAllLeftAboveRef(this) ); BUFPTR uint* pucDes = puc; puc += 3 - iStride; for( n = 1; n < 5; n++ ) { iH += n * as_int(puc[n] - puc[-n]); } puc += (iStride << 2) - 4; for( m = iStride, n = 1; n < 5; n++, m += iStride) { iV += n * as_int(puc[m] - puc[-m]); } puc -= 3 * iStride - 1; int iB = (17 * iH + 16) >> 5; int iC = (17 * iV + 16) >> 5; int iA = 16 * as_int(puc[(iStride << 3) - (iStride + 1)] + puc[ 7 - iStride ]); int x, y; for( y = 0; y < 8; y++ ) { int iYSum = iA + (y-3) * iC + 16; for( x = 0; x < 8; x++ ) { WriteShort( &pucDes[x], gClip((iYSum + (x-3) * iB) >> 5) ); } pucDes += iStride; } } int IntraPredictionPredictChromaBlock( __global IntraPrediction *this, BUFPTR uint* pucCb, BUFPTR uint* pucCr, int iStride, unsigned uiPredMode ) { this->m_uiAvailable = ( this->m_uiAvailableMaskMb >> 4 ) | this->m_uiAvailableMaskMb; switch( uiPredMode ) { case 0: { IntraPredictionXPred8x8IMode0DC ( this, pucCb, iStride ); IntraPredictionXPred8x8IMode0DC ( this, pucCr, iStride ); break; } case 1: { IntraPredictionXPred8x8IMode1Hori ( this, pucCb, iStride ); IntraPredictionXPred8x8IMode1Hori ( this, pucCr, iStride ); break; } case 2: { IntraPredictionXPred8x8IMode2Vert ( this, pucCb, iStride ); IntraPredictionXPred8x8IMode2Vert ( this, pucCr, iStride ); break; } case 3: { IntraPredictionXPred8x8IMode3Plane ( this, pucCb, iStride ); IntraPredictionXPred8x8IMode3Plane ( this, pucCr, iStride ); break; } default: return -1; } return 0; } __kernel void xDecodeChroma( __global YuvMbBuffer* pcRecYuvBuffer, __global YuvMbBuffer* pcPredMbBuffer, __global MbTransformCoeffs* pcDecoderCoeffs, __global MbTransformCoeffs* pcMbCoeffs, __global IntraPrediction* pcIntraPrediction, __global Transform* pcTransform, int bPredChroma, uchar ucChromaPredMode, int bAddBaseCoeffsChroma ) { #ifdef SHARP_AVC_REWRITE_OUTPUT return; #endif BUFPTR uint* pucCb = YuvMbBufferGetMbCbAddr(pcRecYuvBuffer); BUFPTR uint* pucCr = YuvMbBufferGetMbCrAddr(pcRecYuvBuffer); int iStride = MB_BUFFER_WIDTH; if( bPredChroma ) { RNOKVS( IntraPredictionPredictChromaBlock( pcIntraPrediction, pucCb, pucCr, iStride, ucChromaPredMode ) ); YuvMbBufferLoadChroma( pcPredMbBuffer, pcRecYuvBuffer ); } if( bAddBaseCoeffsChroma ) { MbTransformCoeffsAdd( pcDecoderCoeffs, pcMbCoeffs, false, true ); } TCoeff aDC[8]; //=== store DC coeff === { for( int i = 0; i < 8; i++ ) { aDC[i] = MbTransformCoeffsGetChroma(pcDecoderCoeffs, i)[0]; } } TransformInvTransformChromaDc( pcTransform, MbTransformCoeffsGetChroma(pcDecoderCoeffs, 0) ); TransformInvTransformChromaDc( pcTransform, MbTransformCoeffsGetChroma(pcDecoderCoeffs, 4) ); TransformInvTransformChromaBlocks( pcTransform, pucCb, iStride, MbTransformCoeffsGetChroma(pcDecoderCoeffs, 0) ); TransformInvTransformChromaBlocks( pcTransform, pucCr, iStride, MbTransformCoeffsGetChroma(pcDecoderCoeffs, 4) ); //=== reset DC coeff === { for( int i = 0; i < 8; i++ ) { MbTransformCoeffsGetChroma(pcDecoderCoeffs, i)[0] = aDC[i]; } } }

                        • clBuildProgram() hangs
                          himanshu.gautam

                          Anton Zherzdev,

                          Are you able to run the opencl samples which came with SDK.

                          they are at   My documents\ATI .Please post the output of CLinfo sample.

                          Also try CAL samples.

                            • clBuildProgram() hangs
                              AntonZherzdev

                              Himanshu,

                              I haven't tried a lot of samples, just some. They do work. Even my code with those lines commented out works. It's the nature of that code itself that causes problems. If you'd like me to run a certain sample please let me know.

                              I've attached CLInfo output.

                              Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 1.1 ATI-Stream-v2.2 (302) Platform Name: ATI Stream Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback Platform Name: ATI Stream Number of devices: 2 Device Type: CL_DEVICE_TYPE_CPU Device ID: 4098 Max compute units: 2 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 1024 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 2999Mhz Address bits: 32 Max memory allocation: 536870912 Image support: No Max size of kernel argument: 4096 Alignment (bits) of base address: 1024 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: Yes Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: No Cache type: Read/Write Cache line size: 64 Cache size: 32768 Global memory size: 1073741824 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Global Local memory size: 32768 Profiling timer resolution: 0 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: Yes Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 00B3D40C Name: Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz Vendor: GenuineIntel Driver version: 2.0 Profile: FULL_PROFILE Version: OpenCL 1.1 ATI-Stream-v2.2 (302) Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_printf Device Type: CL_DEVICE_TYPE_GPU Device ID: 4098 Max compute units: 10 Max work items dimensions: 3 Max work items[0]: 256 Max work items[1]: 256 Max work items[2]: 256 Max work group size: 256 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 750Mhz Address bits: 32 Max memory allocation: 134217728 Image support: No Max size of kernel argument: 1024 Alignment (bits) of base address: 32768 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 536870912 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Global Local memory size: 16384 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 00B3D40C Name: ATI RV770 Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.792 Profile: FULL_PROFILE Version: OpenCL 1.0 ATI-Stream-v2.2 (302) Extensions: cl_khr_icd cl_amd_fp64 cl_khr_gl_sharing cl_amd_device_attribute_query Passed!

                                • clBuildProgram() hangs
                                  himanshu.gautam

                                  Anton Zherzdev,

                                  i found this code in the kernel,which according to me is invalid,as we are not allowed to create __global variables in kernel.My code doesn't run even when the writeshort function is commented.Are you able to compile it completely in SKA.In my system SKA crashes and clBuildProgram hangs.

                                  #define BUFPTR __global

                                  BUFPTR uint* pucCb = YuvMbBufferGetMbCbAddr(pcRecYuvBuffer);
                                    BUFPTR uint* pucCr = YuvMbBufferGetMbCrAddr(pcRecYuvBuffer);

                                   

                                    • clBuildProgram() hangs
                                      AntonZherzdev

                                      Himanshu,

                                      I'll try that code in SKA as soon as I get to work. But I assure you that clBuildProgram() in my host application returns OK while that code block (a block of 6 lines containing 4 WriteShort calls) is commented out. To avoid possible misunderstanding here I post the same code WITH that piece commented out. Again, first version that  I've posted doesn't compile, this one DOES.

                                      I don't quite understand what's the problem with __global pointers. As far as I see I'm not trying to allocate anything in __global address space, just keeping a couple of pointers pointing at something there. I presume that is legal. Anyway I wonder how can I avoid that if I need to access complex data in __global address space.

                                      /////////////////////////////////////////////////////////////////////////////// //! @param data data in global memory /////////////////////////////////////////////////////////////////////////////// __kernel void YUV2RGB(__global /*__constant*/ uchar4* inputImage, __global uchar4* outputImage, unsigned int width, unsigned int height) { /*__constant*/ __global uchar4* pY = inputImage + get_global_id(0); /*__constant*/ __global uchar4* pU = inputImage + width * height / 4; /*__constant*/ __global uchar4* pV = inputImage + width * height * 5 / 16; unsigned int w4 = width / 4; unsigned int nShift = convert_int_sat(get_global_id(0) / 2); unsigned int nDiv = convert_int_sat(get_global_id(0) / w4); if((nDiv & 1) == 1) { nShift -= w4 * ((nDiv + 1) / 2) / 2; } else { nShift -= w4 * (nDiv / 2) / 2; } pU += nShift; pV += nShift; unsigned int pos = get_global_id(0) * 3; int4 YPixels = (int4)((*pY).x, (*pY).y, (*pY).z, (*pY).w); int4 UPixels = (int4)((*pU).x, (*pU).y, (*pU).z, (*pU).w); int4 VPixels = (int4)((*pV).x, (*pV).y, (*pV).z, (*pV).w); if((get_global_id(0) & 1) == 0) { //propagate U&V UPixels.w = UPixels.y; UPixels.z = UPixels.y; UPixels.y = UPixels.x; VPixels.w = VPixels.y; VPixels.z = VPixels.y; VPixels.y = VPixels.x; } else { //propagate U&V UPixels.x = UPixels.w; UPixels.y = UPixels.w; UPixels.w = UPixels.z; UPixels.x = UPixels.w; UPixels.y = UPixels.w; UPixels.w = UPixels.z; } int4 Ri = (1164*(YPixels - 16) + 1596*(VPixels - 128) + 500)/1000; int4 Gi = (1164*(YPixels - 16) - 813*(VPixels - 128) - 391*(UPixels - 128) + 500)/1000; int4 Bi = (1164*(YPixels - 16) + 2018*(UPixels - 128) + 500)/1000; uchar4 R = convert_uchar4_sat(Ri); uchar4 G = convert_uchar4_sat(Gi); uchar4 B = convert_uchar4_sat(Bi); outputImage[pos] = (uchar4)(R.x, G.x, B.x, R.y); outputImage[pos + 1] = (uchar4)(G.y, B.y, R.z, G.z); outputImage[pos + 2] = (uchar4)(B.z, R.w, G.w, B.w); } __kernel void TESTKRNL(unsigned int a, unsigned int b, __global unsigned int *c) { *c = a + b; } #ifndef DI_CLSTRUCT_H #define DI_CLSTRUCT_H #define MB_BUFFER_WIDTH 24 typedef struct _TCoeff { int m_iCoeffValue; int m_iLevelValue; int m_sPred; // short in structures is not supported by OpenCL } TCoeff; typedef struct _MbTransformCoeffs { TCoeff m_aaiLevel[24][16]; unsigned char m_aaucCoeffCount[24]; } MbTransformCoeffs; typedef struct _Transform // incomplete copy { int m_bClip; } Transform; typedef struct _YuvMbBuffer { // unsigned short m_aucYuvBuffer[MB_BUFFER_WIDTH * (29+1)]; unsigned int m_aucYuvBuffer[MB_BUFFER_WIDTH * (29+1)]; } YuvMbBuffer; typedef struct _IntraPrediction // incomplete copy { unsigned m_uiAvailableMaskMb; unsigned m_uiAvailable; } IntraPrediction; #endif // DI_CLSTRUCT_H // #pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable #define BUFPTR __global //#define BUFPTR /* void WriteShort(__global ushort *where, uint value) { __global uint *aligned = (__global uint*)((uintptr_t)where & 0xFFFFFFFC); if ( (uintptr_t)where & 0x03 ) { *aligned = ((*aligned) & 0x0000FFFF) | (value << 16); } else { *aligned = ((*aligned) & 0xFFFF0000) | (value & 0xFFFF); } } */ void WriteShort(BUFPTR uint *where, uint value) { *where = value; } #define ROFVS( exp ) \ { \ if( !( exp ) ) \ { \ return; \ } \ } #define RNOK( exp ) \ { \ const int nMSysRetVal = ( exp ); \ if( nMSysRetVal != 0 ) \ { \ return nMSysRetVal; \ } \ } #define RNOKVS( exp ) \ { \ if( 0 != ( exp ) ) \ { \ return; \ } \ } #define AOF( exp ) (exp) int gClip( const int iX ) { const int i2 = (iX & 0xFF); if( i2 == iX ) { return iX; } if( iX < 0 ) { return 0x00; } else { return 0xFF; } } int B4x4IdxIsLegal(int idx) { return idx < 16; } void B4x4IdxInc(int *idx) { (*idx)++; } int CIdxIsLegal(int idx) { return idx < 8; } void CIdxInc(int *idx) { (*idx)++; } /****************************************************************************************************** * TCoeff ******************************************************************************************************/ int TCoeffToInt(__global TCoeff *this) { return this->m_iCoeffValue; } void TCoeffConstructor(__global TCoeff *this, int iVal) { this->m_iCoeffValue = iVal; this->m_iLevelValue = iVal; // this->m_sPred = 0; } void TCoeffIncBy(__global TCoeff *this, int iVal) { this->m_iCoeffValue += iVal; } /****************************************************************************************************** * MbTransformCoeffs ******************************************************************************************************/ __global TCoeff* MbTransformCoeffsGetLuma( __global MbTransformCoeffs *this, int cLumaIdx ) { return &this->m_aaiLevel[cLumaIdx][0]; } __global TCoeff* MbTransformCoeffsGetChroma( __global MbTransformCoeffs *this, int cChromaIdx ) { return &this->m_aaiLevel[16+cChromaIdx][0]; } void MbTransformCoeffsAdd( __global MbTransformCoeffs *this, __global MbTransformCoeffs* pcCoeffs, int bLuma, int bChroma ) { if( bLuma ) { for( int bIdx = 0; B4x4IdxIsLegal(bIdx); B4x4IdxInc(&bIdx) ) { __global TCoeff* piCoeff = MbTransformCoeffsGetLuma( this, bIdx ); __global TCoeff* piSrcCoeff = MbTransformCoeffsGetLuma( pcCoeffs, bIdx ); for( unsigned ui=0; ui<16; ui++ ) { TCoeffIncBy( &piCoeff[ui], TCoeffToInt( &piSrcCoeff[ui] ) ); } } } if( bChroma ) { for( int cIdx = 0; CIdxIsLegal(cIdx); CIdxInc(&cIdx) ) { __global TCoeff* piCoeff = MbTransformCoeffsGetChroma( this, cIdx ); __global TCoeff* piSrcCoeff = MbTransformCoeffsGetChroma( pcCoeffs, cIdx ); for( unsigned ui=0; ui<16; ui++ ) { TCoeffIncBy( &piCoeff[ui], TCoeffToInt( &piSrcCoeff[ui] ) ); } } } } /****************************************************************************************************** * Transform ******************************************************************************************************/ // Transform:: int xRound ( int i ) { return ((i)+(1<<5))>>6; } int TransformXClip ( __global Transform* this, int iPel ) { return ( this->m_bClip ? gClip( iPel ) : iPel); } void TransformInvTransformChromaDc( __global Transform* this, __global TCoeff* piCoeff ) { int tmp1, tmp2; int d00, d01, d10, d11; d00 = TCoeffToInt(&piCoeff[0]); d10 = TCoeffToInt(&piCoeff[32]); d01 = TCoeffToInt(&piCoeff[16]); d11 = TCoeffToInt(&piCoeff[48]); tmp1 = d00 + d11; tmp2 = d10 + d01; TCoeffConstructor( &piCoeff[0], ( tmp1 + tmp2 ) >> 5 ); TCoeffConstructor( &piCoeff[48], ( tmp1 - tmp2 ) >> 5 ); tmp1 = d00 - d11; tmp2 = d01 - d10; TCoeffConstructor( &piCoeff[32], ( tmp1 + tmp2 ) >> 5 ); TCoeffConstructor( &piCoeff[16], ( tmp1 - tmp2 ) >> 5 ); } void TransformXInvTransform4x4Blk( __global Transform* this, BUFPTR uint* puc, int iStride, __global TCoeff *piCoeff ) { int aai[4][4]; int tmp1, tmp2; int x, y; int iStride2 = 2*iStride; int iStride3 = 3*iStride; for( x = 0; x < 4; x++, piCoeff+=4 ) { tmp1 = TCoeffToInt( &piCoeff[0] ) + TCoeffToInt( &piCoeff[2] ); tmp2 = (TCoeffToInt( &piCoeff[3] )>>1) + TCoeffToInt( &piCoeff[1]); aai[0][x] = tmp1 + tmp2; aai[3][x] = tmp1 - tmp2; tmp1 = TCoeffToInt( &piCoeff[0] ) - TCoeffToInt( &piCoeff[2] ); tmp2 = (TCoeffToInt( &piCoeff[1] )>>1) - TCoeffToInt( &piCoeff[3] ); aai[1][x] = tmp1 + tmp2; aai[2][x] = tmp1 - tmp2; } for( y = 0; y < 4; y++, puc++ ) { tmp1 = aai[y][0] + aai[y][2]; tmp2 = (aai[y][3]>>1) + aai[y][1]; // comment this block out to prevent compiler hang-up /* { WriteShort( puc, TransformXClip( this, xRound( tmp1 + tmp2 ) + puc[0] ) ); WriteShort( puc + iStride3, TransformXClip( this, xRound( tmp1 - tmp2 ) + puc[iStride3] ) ); tmp1 = aai[y][0] - aai[y][2]; tmp2 = (aai[y][1]>>1) - aai[y][3]; WriteShort( puc + iStride, TransformXClip( this, xRound( tmp1 + tmp2) + puc[iStride] ) ); WriteShort( puc + iStride2, TransformXClip( this, xRound( tmp1 - tmp2) + puc[iStride2] ) ); } */ } } void TransformInvTransformChromaBlocks( __global Transform* this, BUFPTR uint* puc, int iStride, __global TCoeff* piCoeff ) { TransformXInvTransform4x4Blk( this, puc, iStride, piCoeff + 0x00 ); TransformXInvTransform4x4Blk( this, puc + 4, iStride, piCoeff + 0x10 ); puc += iStride << 2; TransformXInvTransform4x4Blk( this, puc, iStride, piCoeff + 0x20 ); TransformXInvTransform4x4Blk( this, puc + 4, iStride, piCoeff + 0x30 ); } /****************************************************************************************************** * YuvMbBuffer ******************************************************************************************************/ #define OFFSET 19 BUFPTR uint* YuvMbBufferGetMbCbAddr( BUFPTR YuvMbBuffer *this ) { return &this->m_aucYuvBuffer[OFFSET*MB_BUFFER_WIDTH + 4]; } BUFPTR uint* YuvMbBufferGetMbCrAddr( BUFPTR YuvMbBuffer *this ) { return &this->m_aucYuvBuffer[OFFSET*MB_BUFFER_WIDTH + 16]; } void YuvMbBufferLoadChroma( BUFPTR YuvMbBuffer *this, BUFPTR YuvMbBuffer *pcSrcBuffer ) { const int iStride = MB_BUFFER_WIDTH; BUFPTR uint* pDes = YuvMbBufferGetMbCbAddr(this); BUFPTR const uint* pSrc = YuvMbBufferGetMbCbAddr(pcSrcBuffer); int y, x; for( y = 0; y < 8; y++ ) { for( x = 0; x < 8; x++ ) { WriteShort( &pDes[x], pSrc[x] ); } pDes += iStride; pSrc += iStride; } pDes = YuvMbBufferGetMbCrAddr(this); pSrc = YuvMbBufferGetMbCrAddr(pcSrcBuffer); for( y = 0; y < 8; y++ ) { for( x = 0; x < 8; x++ ) { WriteShort( &pDes[x], pSrc[x] ); } pDes += iStride; pSrc += iStride; } } /****************************************************************************************************** * IntraPrediction ******************************************************************************************************/ // IntraPrediction:: unsigned xGetS0( BUFPTR uint* puc, int iStride ) { puc -= iStride; return as_uint(puc[0] + puc[1] + puc[2] + puc[3]); } // IntraPrediction:: unsigned xGetS1( BUFPTR uint* puc, int iStride ) { puc -= iStride; return as_uint(puc[4] + puc[5] + puc[6] + puc[7]); } // IntraPrediction:: unsigned xGetS2( BUFPTR uint* puc, int iStride ) { puc--; return as_uint(puc[0] + puc[iStride] + puc[2*iStride] + puc[3*iStride]); } // IntraPrediction:: unsigned xGetS3( BUFPTR uint* puc, int iStride ) { puc += 4 * iStride - 1; return as_uint(puc[0] + puc[iStride] + puc[2*iStride] + puc[3*iStride]); } int IntraPredictionXIsAboveRef( __global IntraPrediction *this ) { return ( this->m_uiAvailable & 0x2 ) == 0; } int IntraPredictionXIsLeftRef( __global IntraPrediction *this ) { return ( this->m_uiAvailable & 0x1 ) == 0; } int IntraPredictionXIsAllLeftAboveRef( __global IntraPrediction *this ) { return ( this->m_uiAvailable & 0x7 ) == 0; } // DC prediction void IntraPredictionXPred8x8IMode0DC( __global IntraPrediction *this, BUFPTR uint* puc, int iStride ) { unsigned uiA, uiB, uiC, uiD; if( ! IntraPredictionXIsAboveRef(this) ) { if( this->m_uiAvailableMaskMb & 0x1 ) // top { uiA = uiB = 0x80; } else { uiA = uiB = (xGetS2( puc, iStride ) + 2) / 4; } if( this->m_uiAvailableMaskMb & 0x10 ) // bot { uiC = uiD = 0x80; } else { uiC = uiD = (xGetS3( puc, iStride ) + 2) / 4; } } else { unsigned uiS0 = xGetS0( puc, iStride ); unsigned uiS1 = xGetS1( puc, iStride ); if( this->m_uiAvailableMaskMb & 0x1 ) // top { uiA = (uiS0 + 2)/4; uiB = (uiS1 + 2)/4; } else { unsigned uiS2 = xGetS2( puc, iStride ); uiA = (uiS0 + uiS2 + 4)/8; uiB = (uiS1 + 2)/4; } if( this->m_uiAvailableMaskMb & 0x10 ) // bot { uiC = (uiS0 + 2)/4; uiD = (uiS1 + 2)/4; } else { unsigned uiS3 = xGetS3( puc, iStride ); uiC = (uiS3 + 2)/4; uiD = (uiS1 + uiS3 + 4)/8; } } int pos; BUFPTR uint* pucDes = puc; for( pos = 0; pos < 4; pos++) { WriteShort( &pucDes[pos], uiA ); WriteShort( &pucDes[pos + 4], uiB ); } for( int n1 = 0; n1 < 3; n1 ++ ) { for (int i = 0; i < 8; ++i) { WriteShort( &pucDes[iStride + i], pucDes[i] ); } pucDes += iStride; } pucDes += iStride; for( pos = 0; pos < 4; pos++) { WriteShort( &pucDes[pos], uiC ); WriteShort( &pucDes[pos + 4], uiD ); } for( int n2 = 0; n2 < 3; n2 ++ ) { for (int i = 0; i < 8; ++i) { WriteShort( &pucDes[iStride + i], pucDes[i] ); } pucDes += iStride; } } // horizontal void IntraPredictionXPred8x8IMode1Hori( __global IntraPrediction *this, BUFPTR uint* puc, int iStride ) { AOF( IntraPredictionXIsLeftRef(this) ); for( int n = 0; n < 8; n++ ) { for( int m = 0; m < 8; m++ ) { WriteShort( &puc[m], puc[m-1] ); } puc += iStride; } } // vertical void IntraPredictionXPred8x8IMode2Vert( __global IntraPrediction *this, BUFPTR uint* puc, int iStride ) { AOF( IntraPredictionXIsAboveRef(this) ); for( int n = 0; n < 8; n++ ) { for( int i = 0; i < 8; i++ ) { WriteShort( &puc[i], *(puc - iStride + i) ); } puc += iStride; } } // plane prediction void IntraPredictionXPred8x8IMode3Plane( __global IntraPrediction *this, BUFPTR uint* puc, int iStride ) { int n, m; int iH = 0; int iV = 0; AOF( IntraPredictionXIsAllLeftAboveRef(this) ); BUFPTR uint* pucDes = puc; puc += 3 - iStride; for( n = 1; n < 5; n++ ) { iH += n * as_int(puc[n] - puc[-n]); } puc += (iStride << 2) - 4; for( m = iStride, n = 1; n < 5; n++, m += iStride) { iV += n * as_int(puc[m] - puc[-m]); } puc -= 3 * iStride - 1; int iB = (17 * iH + 16) >> 5; int iC = (17 * iV + 16) >> 5; int iA = 16 * as_int(puc[(iStride << 3) - (iStride + 1)] + puc[ 7 - iStride ]); int x, y; for( y = 0; y < 8; y++ ) { int iYSum = iA + (y-3) * iC + 16; for( x = 0; x < 8; x++ ) { WriteShort( &pucDes[x], gClip((iYSum + (x-3) * iB) >> 5) ); } pucDes += iStride; } } int IntraPredictionPredictChromaBlock( __global IntraPrediction *this, BUFPTR uint* pucCb, BUFPTR uint* pucCr, int iStride, unsigned uiPredMode ) { this->m_uiAvailable = ( this->m_uiAvailableMaskMb >> 4 ) | this->m_uiAvailableMaskMb; switch( uiPredMode ) { case 0: { IntraPredictionXPred8x8IMode0DC ( this, pucCb, iStride ); IntraPredictionXPred8x8IMode0DC ( this, pucCr, iStride ); break; } case 1: { IntraPredictionXPred8x8IMode1Hori ( this, pucCb, iStride ); IntraPredictionXPred8x8IMode1Hori ( this, pucCr, iStride ); break; } case 2: { IntraPredictionXPred8x8IMode2Vert ( this, pucCb, iStride ); IntraPredictionXPred8x8IMode2Vert ( this, pucCr, iStride ); break; } case 3: { IntraPredictionXPred8x8IMode3Plane ( this, pucCb, iStride ); IntraPredictionXPred8x8IMode3Plane ( this, pucCr, iStride ); break; } default: return -1; } return 0; } __kernel void xDecodeChroma( __global YuvMbBuffer* pcRecYuvBuffer, __global YuvMbBuffer* pcPredMbBuffer, __global MbTransformCoeffs* pcDecoderCoeffs, __global MbTransformCoeffs* pcMbCoeffs, __global IntraPrediction* pcIntraPrediction, __global Transform* pcTransform, int bPredChroma, uchar ucChromaPredMode, int bAddBaseCoeffsChroma ) { #ifdef SHARP_AVC_REWRITE_OUTPUT return; #endif BUFPTR uint* pucCb = YuvMbBufferGetMbCbAddr(pcRecYuvBuffer); BUFPTR uint* pucCr = YuvMbBufferGetMbCrAddr(pcRecYuvBuffer); int iStride = MB_BUFFER_WIDTH; if( bPredChroma ) { RNOKVS( IntraPredictionPredictChromaBlock( pcIntraPrediction, pucCb, pucCr, iStride, ucChromaPredMode ) ); YuvMbBufferLoadChroma( pcPredMbBuffer, pcRecYuvBuffer ); } if( bAddBaseCoeffsChroma ) { MbTransformCoeffsAdd( pcDecoderCoeffs, pcMbCoeffs, false, true ); } TCoeff aDC[8]; //=== store DC coeff === { for( int i = 0; i < 8; i++ ) { aDC[i] = MbTransformCoeffsGetChroma(pcDecoderCoeffs, i)[0]; } } TransformInvTransformChromaDc( pcTransform, MbTransformCoeffsGetChroma(pcDecoderCoeffs, 0) ); TransformInvTransformChromaDc( pcTransform, MbTransformCoeffsGetChroma(pcDecoderCoeffs, 4) ); TransformInvTransformChromaBlocks( pcTransform, pucCb, iStride, MbTransformCoeffsGetChroma(pcDecoderCoeffs, 0) ); TransformInvTransformChromaBlocks( pcTransform, pucCr, iStride, MbTransformCoeffsGetChroma(pcDecoderCoeffs, 4) ); //=== reset DC coeff === { for( int i = 0; i < 8; i++ ) { MbTransformCoeffsGetChroma(pcDecoderCoeffs, i)[0] = aDC[i]; } } }

                          • clBuildProgram() hangs
                            Raistmer
                            having __global pointers is legal. W/o it referencing to buffers parts would be very clumsy.
                            For example I use
                            __global float4 *two = tab + tmp0 * fft_len4;
                            inside kernel and it works OK and as expected.
                            And moreover, w/o __global modifier it will not work cause __global, __constant and __private address ranges are different ones. If kernel argument uses __global pointer all its derivatives should be __global too.