7 Replies Latest reply on Dec 1, 2011 12:14 AM by notyou

    Error unless printing after passing data size???

    notyou

      I realize this sounds incredibly weird, but for some reason, once I pass a certain input size (N = 2561 for this particular example) my returned results never come out as correct unless I print the value while executing the kernel.

      In the attached code (binomial lattice if it helps), if I execute 2561 as i (the number of steps), the buffer value returned is 0.15... and is exactly the value I expect. If I step up to 2562, I get a very large number in the thousands which slowly increases. What I don't understand is that, if I then enable the printf in the kernel then the same buffer I retrieve (absolutely no changes to my main.cpp file) will have the correct value, 0.15... but it takes an incredibly long time to execute because of the printf. If it helps, I am also running clFinish() after each iteration to force it to update (for testing).

      Does anyone have any idea what could be going on? PS. I can provide the main.cpp file as well if needed. Thanks.

      -Matt

      #pragma OPENCL EXTENSION cl_amd_printf : enable __kernel void InitializeValues(__global float *optionValues, float u, int N) { int globalID = get_global_id(0) + 1; float myYValue, myOptionValue; if ( globalID <= N ) { myYValue = pow(u, globalID); myOptionValue = fmax(myYValue - 1.0f, 0.0f); optionValues[globalID] = myOptionValue; } //if ( globalID == N ) // printf("ID: %i\toptVal: %f\n", globalID, optionValues[globalID]); } __kernel void LookbackOpenCL(__local float *sharedOptionValues, __global float *optionValues, __global float *tempOptionValues, int N, float u, float d, float pu, float pd, float disc, int i, __global float* result) { int globalID = get_global_id(0); int localID = get_local_id(0); float myYValue, tempOptionValue; if ( globalID < i ) { sharedOptionValues[localID] = optionValues[globalID]; if ( localID == 255 || globalID == i - 1 ) sharedOptionValues[localID + 1] = optionValues[globalID + 1]; myYValue = pow(u, globalID); } EDIT: barrier(CLK_LOCAL_MEM_FENCE); if ( globalID < i ) { if ( globalID == 0 ) tempOptionValue = fmax(myYValue - 1.0f, ((pu * sharedOptionValues[1] * d) + (pd * sharedOptionValues[0] * u)) * disc); else tempOptionValue = fmax(myYValue - 1.0f, ((pu * sharedOptionValues[localID + 1] * d) + (pd * optionValues[globalID - 1] * u)) * disc); tempOptionValues[globalID] = tempOptionValue; } if ( globalID == 0 ) *result = optionValues[0]; //error here - for some reason, when I don't print the value it comes out incorrectly //if I do print, the value is as expected. // just a random thread //if ( globalID == 0 ) // printf("OptionValues[0]: %f\n", optionValues[0]); }

        • Error unless printing after passing data size???
          nou

          you are writenig to local array so you should use barrier(CLK_LOCAL_MEM_FENCE);

            • Error unless printing after passing data size???
              notyou

              My mistake in the above code, it was an outdated version where I was solely using global memory (edited to reflect this). Just a thought though, I'm using a Mobility Radeon 5870 and the value 2560 divides perfectly by groups of 256 to make 10 workgroups which is the same number of CUs that the 5870M has. Is it possible that there are issues scheduling 11 or more blocks onto the hardware causing this problem?

                • Error unless printing after passing data size???
                  himanshu.gautam

                  Not at all.

                  OpenCL Does not enforce any limit on the global Work size. You can have any number of workgroups and proper scheduling must happen.

                   

                  Are you referring to the first or secong kernel for the correctness issue. If you still face this issue, also post the host code, so we may try to reproduce it at our end.

                    • Error unless printing after passing data size???
                      notyou

                       

                      Originally posted by: himanshu.gautam Not at all.

                       

                      OpenCL Does not enforce any limit on the global Work size. You can have any number of workgroups and proper scheduling must happen.

                       

                      Are you referring to the first or secong kernel for the correctness issue. If you still face this issue, also post the host code, so we may try to reproduce it at our end.



                      I didn't think OpenCL had any restrictions on the global size since I have used larger groupings than this before without issues.

                      The problem I'm having is regarding the second kernel (but I should mention that as N increases I lose precision in the first kernel [so I'll look at increasing precision in the future]. But this should make at most a value difference of 1-2 by the end, instead I'm high by a few thousand.

                      Since I have used a number of wrapper functions I've include everything here and you can pick out what you need. See your PM for the password.

                      https://rapidshare.com/files/1396036414/Binomial_Lattice.zip

                      System Info:

                      Windows 7 x64 Home Premium

                      MS Visual Studio 2010

                      SDK 2.5

                      Driver 11.11 (I tried rolling back to 11.10 which also had this same issue)

                      i7-720QM

                      Mobility Radeon 5870