Hello OpenCLers,
I'm seeing two problems with cl_amd_printf. These problems only occur when running on the GPU.
The details of my system are:
- Ubuntu 14.04 LTS, x86_64
- processor: AMD A10 7850K
- Catalyst 4.4.12874
- clinfo version: OpenCL 1.2 AMD-APP (1445.5)
- running on the GPU
I've pasted code to demonstrate these problems below. On my system, the first time the code is run it will print the "lucky number" line. The second time the code is run, I see the message that "Built kernel retrieved from cache" and a segmentation fault.
Mark
---------------------------------------------------------------------
import sys
import pyopencl as cl
KERNEL_STR = r'''
#pragma OPENCL EXTENSION cl_amd_printf : enable
__kernel void cl_kernel() {
// The line below does not get printed.
printf("Hello, world\n");
// This line will get printed.
printf("My lucky number is %d\n", 7);
// The line below does not get printed.
printf("Hello again.\n");
}
'''
def main():
platform = cl.get_platforms()[0]
devices = platform.get_devices(cl.device_type.GPU)
ctx = cl.Context(devices)
device = devices[0]
cmd_q = cl.CommandQueue(ctx, device)
program = cl.Program(ctx, KERNEL_STR)
program.build(devices=[device])
program.cl_kernel(cmd_q, (4,), None)
if __name__ == '__main__':
sys.exit(main())
Hi,
Thanks for reporting. We'll try to reproduce the error and get back to you.
Regards,
Hi,
I am able to reproduce the printf error (but under different system setup and driver). I'll do some more test and, if required, I'll file an internal bug report.
Regards,
You were able to reproduce both the missing output problem, and the segmentation fault?
Mark
Hi Mark,
I was able to reproduce the missing output problem, but not the segmentation fault.
[FYI: I used VS2012 and AMD HD6970 using a different version of driver (greater than your, as per "clinfo").]
Regards,
I can confirm segmentation fault too:
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x54DCC4C9 read attempt to address 0x00000010
- Registers -
eax=00000000 ebx=00000000 ecx=0676fd2c edx=0666e8b8 esi=00000000 edi=006555a8
eip=54dcc4c9 esp=0676fcd0 ebp=0676fcdc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010246
- Callstack -
ChildEBP RetAddr Args to Child
0676fcdc 54e35b80 00000000 00000000 ffffffff 00000002 amdocl!oclGetAsic+0x0
0676fd88 54e35258 00000000 03313000 0676fdc0 0846c1c8 amdocl!clSetKernelExecInfo+0x0
0676fdd0 54e19db2 05043ba8 003f4b01 00000000 04fde860 amdocl!clSetKernelExecInfo+0x0
0676fe88 54e19993 0846c1b4 00000000 00000000 00000000 amdocl!clSetKernelExecInfo+0x0
0676febc 54ded396 0846c0f0 04dcb608 04dcb588 04dcb608 amdocl!clSetKernelExecInfo+0x0
0676ff14 54ded456 05043ba8 04dcb608 00658a08 0676ff4c amdocl!clSetKernelExecInfo+0x0
0676ff34 54de91cd 04dcb500 00000000 00000000 00658a08 amdocl!clSetKernelExecInfo+0x0
0676ff58 54dfe2fc 00000000 00000000 04dcb608 00000000 amdocl!clSetKernelExecInfo+0x0
0676ff88 7561338a 04dcb608 0676ffd4 779d9f72 04dcb608 amdocl!clSetKernelExecInfo+0x0
0676ff94 779d9f72 04dcb608 4d5e5df8 00000000 00000000 kernel32!BaseThreadInitThunk+0x0
0676ffd4 779d9f45 54dfe2c0 04dcb608 00000000 00000000 ntdll!RtlInitializeExceptionChain+0x0
0676ffec 00000000 54dfe2c0 04dcb608 00000000 00000006 ntdll!RtlInitializeExceptionChain+0x0
The reason: kernel was read from binary cache and it contained: printf("idx=%d\n",idx);
With this line commented out crash doesn't happen. With this line present but kernel built from source- crash doesn't happen.
Driver/device:
Name: | Loveland | |||||
Vendor: | Advanced Micro Devices, Inc. | |||||
Driver version: | 1642.5 (VM) | |||||
Version: | OpenCL 1.2 AMD-APP (1642.5) |
Thanks for reporting. Could you please share your setup details where the segfault was observed?
Please share your kernel code too if it's different than the above one.
Regards,
Besides already listed: it's windows 7 x64 host.
Kernel is:
__attribute__((reqd_work_group_size(RESULT_SIZE, 2, 1)))
__kernel void Gaussian_logging_kernel_twin(__global const float* restrict PoT, __global const float* restrict NormMaxPower,
int ul_FftLength,
__constant ocl_GaussFit_t* restrict settings, __global uint* restrict result_flag,
__global float4* restrict GaussFitResults,__global GPUState* restrict gpu_state,
__global const float* restrict GaussFitScores){
__local Gaussian_core signals[RESULT_SIZE][2][MAX_GPU_GAUSSIANS+1];
__local int res_size[RESULT_SIZE][2];
__local float best_score[RESULT_SIZE][2];
const int tid=get_global_id(0);
const int neg=get_global_id(1);
int bfft_ind;
int bbin;
float bscore=-12.f;
int idx=gpu_state->gaussians.index;
int local_idx=0;
int start=settings->GaussTOffsetStart;
int stop=settings->GaussTOffsetStop;
int stride=(RESULT_SIZE>ul_FftLength?1:ul_FftLength/RESULT_SIZE);
if(result_flag[tid+RESULT_SIZE*neg]==1){
result_flag[tid+RESULT_SIZE*neg]=0;
bscore=gpu_state->gaussians.gaussian[0].score;
//max(gpu_state->gaussians.bscore,gpu_state->gaussians.gaussian[0].score);//R take "best"
for(int j=max(tid*stride,1);j<(tid+1)*stride;j++)//R: DC PoT always skipped in Gaussian search
for(int k=start;k<stop;k++){
if(idx+local_idx>=MAX_GPU_GAUSSIANS)break;//R:overflow
int coord=ul_FftLength*k+j+ul_FftLength*GAUSS_POT_LENGTH*neg;
float4 tmp=GaussFitResults[coord];
if(tmp.x==0.0f)continue;//R: nothing here
float score = (idx?-12.f:GaussFitScores[coord]);
//R: best update
if( (score > bscore)&&(tmp.z<=settings->gauss_chi_sq_thresh) ){
bscore=score;
bfft_ind=k;
bbin=j;
}
//R: new gaussian logging locally
if( (tmp.z<=settings->GaussChiSqThresh) && (tmp.w >= settings->gauss_null_chi_sq_thresh) &&
(tmp.y>=tmp.x*settings->GaussPeakPowerThresh)){
local_idx++;
signals[tid][neg][local_idx].fft_ind=k;
signals[tid][neg][local_idx].bin=j;
}
}
//R: best gaussian update in global memory
signals[tid][neg][0].fft_ind=bfft_ind;
signals[tid][neg][0].bin=bbin;
}
//R: these 2 updated always, even if nothing in current chunk
best_score[tid][neg]=bscore;
res_size[tid][neg]=local_idx;
barrier(CLK_LOCAL_MEM_FENCE);
//R: now complete reduction from local memory
if(tid+neg == 0){//R: single first workitem does the reduction job
bscore=gpu_state->gaussians.gaussian[0].score;
for(int i=0;i<2;i++){for(int j=0;j<RESULT_SIZE;j++){
//R:best update
if(best_score
bscore=best_score
bfft_ind=signals
bbin=signals
}
//R: new gaussian logging
for(int k=1;k<res_size
if(idx>=MAX_GPU_GAUSSIANS){//R: update global mem data and return on overflow
gpu_state->gaussians.index=idx;
if(bscore>gpu_state->gaussians.gaussian[0].score){//R: best gaussian update in global memory
float4 tmp=GaussFitResults[ul_FftLength*bfft_ind+bbin+ul_FftLength*GAUSS_POT_LENGTH*i];
gpu_state->gaussians.gaussian[0].score=bscore;
gpu_state->gaussians.gaussian[0].icfft=(i?-gpu_state->icfft:gpu_state->icfft);
gpu_state->gaussians.gaussian[0].fft_ind=bfft_ind;
gpu_state->gaussians.gaussian[0].bin=bbin;
gpu_state->gaussians.gaussian[0].peak_power=tmp.y;
gpu_state->gaussians.gaussian[0].mean_power=tmp.x;
gpu_state->gaussians.gaussian[0].chisqr=tmp.z;
gpu_state->gaussians.gaussian[0].null_chisqr=tmp.w;
gpu_state->gaussians.gaussian[0].max_power=NormMaxPower[bbin+i*ul_FftLength];
for(int m=0;m<GAUSSIAN_POT_LENGTH;m++)
gpu_state->gaussians.gaussian[0].pot
}
return;
}
idx++;
gpu_state->gaussians.gaussian[idx].icfft=(i?-gpu_state->icfft:gpu_state->icfft);
gpu_state->gaussians.gaussian[idx].fft_ind=signals
gpu_state->gaussians.gaussian[idx].bin=signals
float4 tmp=
GaussFitResults[ul_FftLength*signals
gpu_state->gaussians.gaussian[idx].peak_power=tmp.y;
gpu_state->gaussians.gaussian[idx].mean_power=tmp.x;
gpu_state->gaussians.gaussian[idx].chisqr=tmp.z;
gpu_state->gaussians.gaussian[idx].null_chisqr=tmp.w;
gpu_state->gaussians.gaussian[idx].max_power=NormMaxPower[j+i*ul_FftLength];
//R: TODO evaluate async workgroup copy
for(int m=0;m<GAUSSIAN_POT_LENGTH;m++)
gpu_state->gaussians.gaussian[idx].pot
}
}
if(bscore>gpu_state->gaussians.gaussian[0].score){//R: best gaussian update in global memory
gpu_state->gaussians.gaussian[0].score=bscore;
gpu_state->gaussians.gaussian[0].icfft=(i?-gpu_state->icfft:gpu_state->icfft);
gpu_state->gaussians.gaussian[0].fft_ind=bfft_ind;
gpu_state->gaussians.gaussian[0].bin=bbin;
float4 tmp=GaussFitResults[ul_FftLength*bfft_ind+bbin+ul_FftLength*GAUSS_POT_LENGTH*i];
gpu_state->gaussians.gaussian[0].peak_power=tmp.y;
gpu_state->gaussians.gaussian[0].mean_power=tmp.x;
gpu_state->gaussians.gaussian[0].chisqr=tmp.z;
gpu_state->gaussians.gaussian[0].null_chisqr=tmp.w;
gpu_state->gaussians.gaussian[0].max_power=NormMaxPower[bbin+i*ul_FftLength];
for(int m=0;m<GAUSSIAN_POT_LENGTH;m++)
gpu_state->gaussians.gaussian[0].pot
}
}
gpu_state->gaussians.index=idx;
//printf("idx=%d\n",idx);
}//end common work
}
Hi,
AFAIK, "Loveland" belongs to TeraScale series. Here is the latest driver available for that series: Legacy. From your clinfo output, it seems that your driver is older than that. Did you check with this latest driver?
Btw, TeraScale or any other pre-GCN products have been moved to a legacy support model and no additional driver releases are planned. Please check the release note for details.
Regards,
Here on my computer, I do so and it works:
// The line below does not get printed.
printf("Hello, world%s\n"," ");
// The line below does not get printed.
printf("Hello again.%s\n", " ");
And a line break like this: printf("%s\n", " ");