Archives Discussions

Archives Discussions

confirm bugs in cl_amd_printf

mkohler · ‎06-20-2014

Hello OpenCLers,

I'm seeing two problems with cl_amd_printf. These problems only occur when running on the GPU.

If the input string doesn't include a % format specifier, the input string doesn't get printed.
If the built kernel is retrieved from cache, a Segmentation fault occurs.

The details of my system are:

- Ubuntu 14.04 LTS, x86_64

- processor: AMD A10 7850K

- Catalyst 4.4.12874

- clinfo version: OpenCL 1.2 AMD-APP (1445.5)

- running on the GPU

I've pasted code to demonstrate these problems below. On my system, the first time the code is run it will print the "lucky number" line. The second time the code is run, I see the message that "Built kernel retrieved from cache" and a segmentation fault.

Mark

---------------------------------------------------------------------

import sys

import pyopencl as cl

KERNEL_STR = r'''

#pragma OPENCL EXTENSION cl_amd_printf : enable

__kernel void cl_kernel() {

// The line below does not get printed.

printf("Hello, world\n");

// This line will get printed.

printf("My lucky number is %d\n", 7);

// The line below does not get printed.

printf("Hello again.\n");

}

'''

def main():

platform = cl.get_platforms()[0]

devices = platform.get_devices(cl.device_type.GPU)

ctx = cl.Context(devices)

device = devices[0]

cmd_q = cl.CommandQueue(ctx, device)

program = cl.Program(ctx, KERNEL_STR)

program.build(devices=[device])

program.cl_kernel(cmd_q, (4,), None)

if __name__ == '__main__':

sys.exit(main())

dipak · ‎06-23-2014

Hi,

Thanks for reporting. We'll try to reproduce the error and get back to you.

Regards,

dipak · ‎06-26-2014

Hi,

I am able to reproduce the printf error (but under different system setup and driver). I'll do some more test and, if required, I'll file an internal bug report.

Regards,

mkohler · ‎06-26-2014

You were able to reproduce both the missing output problem, and the segmentation fault?

Mark

dipak · ‎06-27-2014

Hi Mark,

I was able to reproduce the missing output problem, but not the segmentation fault.

[FYI: I used VS2012 and AMD HD6970 using a different version of driver (greater than your, as per "clinfo").]

Regards,

Raistmer · ‎09-27-2016

I can confirm segmentation fault too:

- Unhandled Exception Record -

Reason: Access Violation (0xc0000005) at address 0x54DCC4C9 read attempt to address 0x00000010

- Registers -

eax=00000000 ebx=00000000 ecx=0676fd2c edx=0666e8b8 esi=00000000 edi=006555a8

eip=54dcc4c9 esp=0676fcd0 ebp=0676fcdc

cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010246

- Callstack -

ChildEBP RetAddr Args to Child

0676fcdc 54e35b80 00000000 00000000 ffffffff 00000002 amdocl!oclGetAsic+0x0

0676fd88 54e35258 00000000 03313000 0676fdc0 0846c1c8 amdocl!clSetKernelExecInfo+0x0

0676fdd0 54e19db2 05043ba8 003f4b01 00000000 04fde860 amdocl!clSetKernelExecInfo+0x0

0676fe88 54e19993 0846c1b4 00000000 00000000 00000000 amdocl!clSetKernelExecInfo+0x0

0676febc 54ded396 0846c0f0 04dcb608 04dcb588 04dcb608 amdocl!clSetKernelExecInfo+0x0

0676ff14 54ded456 05043ba8 04dcb608 00658a08 0676ff4c amdocl!clSetKernelExecInfo+0x0

0676ff34 54de91cd 04dcb500 00000000 00000000 00658a08 amdocl!clSetKernelExecInfo+0x0

0676ff58 54dfe2fc 00000000 00000000 04dcb608 00000000 amdocl!clSetKernelExecInfo+0x0

0676ff88 7561338a 04dcb608 0676ffd4 779d9f72 04dcb608 amdocl!clSetKernelExecInfo+0x0

0676ff94 779d9f72 04dcb608 4d5e5df8 00000000 00000000 kernel32!BaseThreadInitThunk+0x0

0676ffd4 779d9f45 54dfe2c0 04dcb608 00000000 00000000 ntdll!RtlInitializeExceptionChain+0x0

0676ffec 00000000 54dfe2c0 04dcb608 00000000 00000006 ntdll!RtlInitializeExceptionChain+0x0

The reason: kernel was read from binary cache and it contained: printf("idx=%d\n",idx);

With this line commented out crash doesn't happen. With this line present but kernel built from source- crash doesn't happen.

Driver/device:

dipak · ‎09-28-2016

Thanks for reporting. Could you please share your setup details where the segfault was observed?

Please share your kernel code too if it's different than the above one.

Regards,

Raistmer · ‎09-28-2016

Besides already listed: it's windows 7 x64 host.

Kernel is:

__attribute__((reqd_work_group_size(RESULT_SIZE, 2, 1)))

__kernel void Gaussian_logging_kernel_twin(__global const float* restrict PoT, __global const float* restrict NormMaxPower,

int ul_FftLength,

__constant ocl_GaussFit_t* restrict settings, __global uint* restrict result_flag,

__global float4* restrict GaussFitResults,__global GPUState* restrict gpu_state,

__global const float* restrict GaussFitScores){

__local Gaussian_core signals[RESULT_SIZE][2][MAX_GPU_GAUSSIANS+1];

__local int res_size[RESULT_SIZE][2];

__local float best_score[RESULT_SIZE][2];

const int tid=get_global_id(0);

const int neg=get_global_id(1);

int bfft_ind;

int bbin;

float bscore=-12.f;

int idx=gpu_state->gaussians.index;

int local_idx=0;

int start=settings->GaussTOffsetStart;

int stop=settings->GaussTOffsetStop;

int stride=(RESULT_SIZE>ul_FftLength?1:ul_FftLength/RESULT_SIZE);

if(result_flag[tid+RESULT_SIZE*neg]==1){

result_flag[tid+RESULT_SIZE*neg]=0;

bscore=gpu_state->gaussians.gaussian[0].score;

//max(gpu_state->gaussians.bscore,gpu_state->gaussians.gaussian[0].score);//R take "best"

for(int j=max(tid*stride,1);j<(tid+1)*stride;j++)//R: DC PoT always skipped in Gaussian search

for(int k=start;k<stop;k++){

if(idx+local_idx>=MAX_GPU_GAUSSIANS)break;//R:overflow

int coord=ul_FftLength*k+j+ul_FftLength*GAUSS_POT_LENGTH*neg;

float4 tmp=GaussFitResults[coord];

if(tmp.x==0.0f)continue;//R: nothing here

float score = (idx?-12.f:GaussFitScores[coord]);

//R: best update

if( (score > bscore)&&(tmp.z<=settings->gauss_chi_sq_thresh) ){

bscore=score;

bfft_ind=k;

bbin=j;

}

//R: new gaussian logging locally

if( (tmp.z<=settings->GaussChiSqThresh) && (tmp.w >= settings->gauss_null_chi_sq_thresh) &&

(tmp.y>=tmp.x*settings->GaussPeakPowerThresh)){

local_idx++;

signals[tid][neg][local_idx].fft_ind=k;

signals[tid][neg][local_idx].bin=j;

}

//R: best gaussian update in global memory

signals[tid][neg][0].fft_ind=bfft_ind;

signals[tid][neg][0].bin=bbin;

}

//R: these 2 updated always, even if nothing in current chunk

best_score[tid][neg]=bscore;

res_size[tid][neg]=local_idx;

barrier(CLK_LOCAL_MEM_FENCE);

//R: now complete reduction from local memory

if(tid+neg == 0){//R: single first workitem does the reduction job

bscore=gpu_state->gaussians.gaussian[0].score;

for(int i=0;i<2;i++){for(int j=0;j<RESULT_SIZE;j++){

//R:best update

if(best_score>bscore){

bscore=best_score;

bfft_ind=signals[0].fft_ind;

bbin=signals[0].bin;

}

//R: new gaussian logging

for(int k=1;k<res_size;k++){

if(idx>=MAX_GPU_GAUSSIANS){//R: update global mem data and return on overflow

gpu_state->gaussians.index=idx;

if(bscore>gpu_state->gaussians.gaussian[0].score){//R: best gaussian update in global memory

float4 tmp=GaussFitResults[ul_FftLength*bfft_ind+bbin+ul_FftLength*GAUSS_POT_LENGTH*i];

gpu_state->gaussians.gaussian[0].score=bscore;

gpu_state->gaussians.gaussian[0].icfft=(i?-gpu_state->icfft:gpu_state->icfft);

gpu_state->gaussians.gaussian[0].fft_ind=bfft_ind;

gpu_state->gaussians.gaussian[0].bin=bbin;

gpu_state->gaussians.gaussian[0].peak_power=tmp.y;

gpu_state->gaussians.gaussian[0].mean_power=tmp.x;

gpu_state->gaussians.gaussian[0].chisqr=tmp.z;

gpu_state->gaussians.gaussian[0].null_chisqr=tmp.w;

gpu_state->gaussians.gaussian[0].max_power=NormMaxPower[bbin+i*ul_FftLength];

for(int m=0;m<GAUSSIAN_POT_LENGTH;m++)

gpu_state->gaussians.gaussian[0].pot=PoT[m*ul_FftLength+bbin+i*ul_FftLength*GAUSS_POT_LENGTH];

}

return;

}

idx++;

gpu_state->gaussians.gaussian[idx].icfft=(i?-gpu_state->icfft:gpu_state->icfft);

gpu_state->gaussians.gaussian[idx].fft_ind=signals.fft_ind;

gpu_state->gaussians.gaussian[idx].bin=signals.bin;

float4 tmp=

GaussFitResults[ul_FftLength*signals.fft_ind+signals.bin+ul_FftLength*GAUSS_POT_LENGTH*i];

gpu_state->gaussians.gaussian[idx].peak_power=tmp.y;

gpu_state->gaussians.gaussian[idx].mean_power=tmp.x;

gpu_state->gaussians.gaussian[idx].chisqr=tmp.z;

gpu_state->gaussians.gaussian[idx].null_chisqr=tmp.w;

gpu_state->gaussians.gaussian[idx].max_power=NormMaxPower[j+i*ul_FftLength];

//R: TODO evaluate async workgroup copy

for(int m=0;m<GAUSSIAN_POT_LENGTH;m++)

gpu_state->gaussians.gaussian[idx].pot=PoT[m*ul_FftLength+signals.bin+i*ul_FftLength*GAUSS_POT_LENGTH];

}

if(bscore>gpu_state->gaussians.gaussian[0].score){//R: best gaussian update in global memory

gpu_state->gaussians.gaussian[0].score=bscore;

gpu_state->gaussians.gaussian[0].icfft=(i?-gpu_state->icfft:gpu_state->icfft);

gpu_state->gaussians.gaussian[0].fft_ind=bfft_ind;

gpu_state->gaussians.gaussian[0].bin=bbin;

float4 tmp=GaussFitResults[ul_FftLength*bfft_ind+bbin+ul_FftLength*GAUSS_POT_LENGTH*i];

gpu_state->gaussians.gaussian[0].peak_power=tmp.y;

gpu_state->gaussians.gaussian[0].mean_power=tmp.x;

gpu_state->gaussians.gaussian[0].chisqr=tmp.z;

gpu_state->gaussians.gaussian[0].null_chisqr=tmp.w;

gpu_state->gaussians.gaussian[0].max_power=NormMaxPower[bbin+i*ul_FftLength];

for(int m=0;m<GAUSSIAN_POT_LENGTH;m++)

gpu_state->gaussians.gaussian[0].pot=PoT[m*ul_FftLength+bbin+i*ul_FftLength*GAUSS_POT_LENGTH];

}

gpu_state->gaussians.index=idx;

//printf("idx=%d\n",idx);

}//end common work

}

dipak · ‎09-29-2016

Hi,

AFAIK, "Loveland" belongs to TeraScale series. Here is the latest driver available for that series: Legacy. From your clinfo output, it seems that your driver is older than that. Did you check with this latest driver?

Btw, TeraScale or any other pre-GCN products have been moved to a legacy support model and no additional driver releases are planned. Please check the release note for details.

Regards,

alexfd7 · ‎09-17-2016

Here on my computer, I do so and it works:

// The line below does not get printed.

printf("Hello, world%s\n"," ");

// The line below does not get printed.

printf("Hello again.%s\n", " ");

And a line break like this: printf("%s\n", " ");

Advanced Micro Devices, Inc.

Driver version:

OpenCL 1.2 AMD-APP (1642.5)