Archives Discussions

notyou · ‎11-26-2011

I realize this sounds incredibly weird, but for some reason, once I pass a certain input size (N = 2561 for this particular example) my returned results never come out as correct unless I print the value while executing the kernel.

In the attached code (binomial lattice if it helps), if I execute 2561 as i (the number of steps), the buffer value returned is 0.15... and is exactly the value I expect. If I step up to 2562, I get a very large number in the thousands which slowly increases. What I don't understand is that, if I then enable the printf in the kernel then the same buffer I retrieve (absolutely no changes to my main.cpp file) will have the correct value, 0.15... but it takes an incredibly long time to execute because of the printf. If it helps, I am also running clFinish() after each iteration to force it to update (for testing).

Does anyone have any idea what could be going on? PS. I can provide the main.cpp file as well if needed. Thanks.

-Matt

#pragma OPENCL EXTENSION cl_amd_printf : enable __kernel void InitializeValues(__global float *optionValues, float u, int N) { int globalID = get_global_id(0) + 1; float myYValue, myOptionValue; if ( globalID <= N ) { myYValue = pow(u, globalID); myOptionValue = fmax(myYValue - 1.0f, 0.0f); optionValues[globalID] = myOptionValue; } //if ( globalID == N ) // printf("ID: %i\toptVal: %f\n", globalID, optionValues[globalID]); } __kernel void LookbackOpenCL(__local float *sharedOptionValues, __global float *optionValues, __global float *tempOptionValues, int N, float u, float d, float pu, float pd, float disc, int i, __global float* result) { int globalID = get_global_id(0); int localID = get_local_id(0); float myYValue, tempOptionValue; if ( globalID < i ) { sharedOptionValues[localID] = optionValues[globalID]; if ( localID == 255 || globalID == i - 1 ) sharedOptionValues[localID + 1] = optionValues[globalID + 1]; myYValue = pow(u, globalID); } EDIT: barrier(CLK_LOCAL_MEM_FENCE); if ( globalID < i ) { if ( globalID == 0 ) tempOptionValue = fmax(myYValue - 1.0f, ((pu * sharedOptionValues[1] * d) + (pd * sharedOptionValues[0] * u)) * disc); else tempOptionValue = fmax(myYValue - 1.0f, ((pu * sharedOptionValues[localID + 1] * d) + (pd * optionValues[globalID - 1] * u)) * disc); tempOptionValues[globalID] = tempOptionValue; } if ( globalID == 0 ) *result = optionValues[0]; //error here - for some reason, when I don't print the value it comes out incorrectly //if I do print, the value is as expected. // just a random thread //if ( globalID == 0 ) // printf("OptionValues[0]: %f\n", optionValues[0]); }

nou · ‎11-27-2011

you are writenig to local array so you should use barrier(CLK_LOCAL_MEM_FENCE);

notyou · ‎11-27-2011

My mistake in the above code, it was an outdated version where I was solely using global memory (edited to reflect this). Just a thought though, I'm using a Mobility Radeon 5870 and the value 2560 divides perfectly by groups of 256 to make 10 workgroups which is the same number of CUs that the 5870M has. Is it possible that there are issues scheduling 11 or more blocks onto the hardware causing this problem?

himanshu_gautam · ‎11-28-2011

Not at all.

OpenCL Does not enforce any limit on the global Work size. You can have any number of workgroups and proper scheduling must happen.

Are you referring to the first or secong kernel for the correctness issue. If you still face this issue, also post the host code, so we may try to reproduce it at our end.

notyou · ‎11-28-2011

Originally posted by: himanshu.gautam Not at all.

OpenCL Does not enforce any limit on the global Work size. You can have any number of workgroups and proper scheduling must happen.

Are you referring to the first or secong kernel for the correctness issue. If you still face this issue, also post the host code, so we may try to reproduce it at our end.

I didn't think OpenCL had any restrictions on the global size since I have used larger groupings than this before without issues.

The problem I'm having is regarding the second kernel (but I should mention that as N increases I lose precision in the first kernel [so I'll look at increasing precision in the future]. But this should make at most a value difference of 1-2 by the end, instead I'm high by a few thousand.

Since I have used a number of wrapper functions I've include everything here and you can pick out what you need. See your PM for the password.

https://rapidshare.com/files/1396036414/Binomial_Lattice.zip

System Info:

Windows 7 x64 Home Premium

MS Visual Studio 2010

SDK 2.5

Driver 11.11 (I tried rolling back to 11.10 which also had this same issue)

i7-720QM

Mobility Radeon 5870

notyou · ‎11-28-2011

Just to add, the issue also appears when I run the program on a CPU OpenCL device and it also happens when I use host memory. I will look at running the program on my APU system later to see if the problem is specific to this system.

notyou · ‎11-30-2011

It appears the issue also appears on the E-350 APU except it occurs when the input size passes 512. Again, strangely this is 256 times the number of CUs (2). Is it possible that executing more blocks causes internal scheduling problems which causes the threads to overwrite data?

notyou · ‎12-01-2011

Figured it out. It was the use of local memory causing the errors. I'm not sure why the issue showed up only after creating more workgroups than compute units (overwriting local memory somehow?), but removing local memory and resorting to only using global memory fixed the problem for sizes up to 32768 (verified, checking larger now).

Archives Discussions

Error unless printing after passing data size???