6 Replies Latest reply on Nov 11, 2009 10:36 PM by Stib

    Disturbing numbers

    Stib
      Some VERY disturbing numbers...

      I have run the same kernel with 2 different work group settings on my CPU, and GPU 10 times. The output i get is wrong. Very wrong!

      Is my hardware broken, or what could be the problem??

      The kernel: __kernel void hello(__global int *out) { out[0]=get_global_id(0); out[1]=get_local_id(0); out[2]=get_global_id(1); out[3]=get_local_id(1); out[4]=get_global_id(2); out[5]=get_local_id(2); out[6]+=1; } ...and my full output, with every detail: Available platforms: 1 Platform profile: FULL_PROFILE Platform version: OpenCL 1.0 ATI-Stream-v2.0-beta4 Platform name: ATI Stream Platform vendor: Advanced Micro Devices, Inc. Platform extensions: Number of devices: 2 Device #1: CPU Vendor ID: 4098 Max compute units: 2 Max work item dimensions: 3 Max work item sizes: ( 1024, 1024, 1024 ) Max work group size: 1024 Max clock requency: 3005 MHz Address bits: 64 Max mem alloc size: 1024 MB Image support: FALSE Max parameter size: 4096 byte Mem base addr align: 1024 bit Min data type align size: 128 byte Denorms supported: TRUE INF and quiet NaNs supported: TRUE Round to nearest supported: TRUE Round to zero supported: FALSE Round to +/- inf supported: FALSE IEEE754-2008 FMA supported: FALSE Global mem cache type: Read and Write Size of global mem cache line: 64 byte Size of global mem cache: 64 KB Global mem size: 3072 MB Max size of a const buff alloc: 64 KB Max number of const args: 8 Local memory type: Global Local memory size: 32 KB Error correction support: FALSE Resolution of device timer: 1 nanosec Litle endian device: TRUE Device available: TRUE Compiler available: TRUE Can execute OpenCL kernels: TRUE Can execute native kernels: FALSE Out of order exec enabled: FALSE Queue profiling enabled: TRUE Associated platform: ATI Stream Device name: Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz Vendor: GenuineIntel Driver version: 1.0 Supported profile: FULL_PROFILE Supported OpenCL version: OpenCL 1.0 ATI-Stream-v2.0-beta4 Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store Context reference count: 1 Device #2: GPU Vendor ID: 4098 Max compute units: 10 Max work item dimensions: 3 Max work item sizes: ( 256, 256, 256 ) Max work group size: 256 Max clock requency: 625 MHz Address bits: 32 Max mem alloc size: 128 MB Image support: FALSE Max parameter size: 1024 byte Mem base addr align: 32768 bit Min data type align size: 128 byte Denorms supported: FALSE INF and quiet NaNs supported: TRUE Round to nearest supported: TRUE Round to zero supported: FALSE Round to +/- inf supported: FALSE IEEE754-2008 FMA supported: FALSE Global mem cache type: None Size of global mem cache line: 0 byte Size of global mem cache: 0 KB Global mem size: 128 MB Max size of a const buff alloc: 64 KB Max number of const args: 8 Local memory type: Global Local memory size: 16 KB Error correction support: FALSE Resolution of device timer: 1 nanosec Litle endian device: TRUE Device available: TRUE Compiler available: TRUE Can execute OpenCL kernels: TRUE Can execute native kernels: FALSE Out of order exec enabled: FALSE Queue profiling enabled: TRUE Associated platform: ATI Stream Device name: ATI RV770 Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.467 Supported profile: FULL_PROFILE Supported OpenCL version: OpenCL 1.0 ATI-Stream-v2.0-beta4 Extensions: Context reference count: 1 Loading and compiling CL source Kernel Function name: hello Number of kernel args: 1 Kernel reference count: 1 #1 ------------------------------------------------------- Device #1 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 11 localID dim1: 5 globalID dim2: 11 localID dim2: 5 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 1164 ? 'hello' run time: 2.9981e-005 sec Device #2 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 9 localID dim1: 3 globalID dim2: 10 localID dim2: 4 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 8 ? 'hello' run time: 4.2464e-005 sec Device #1 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 5 localID dim1: 5 globalID dim2: 5 localID dim2: 5 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 216 ? 'hello' run time: 5.111e-006 sec Device #2 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 3 localID dim1: 3 globalID dim2: 4 localID dim2: 4 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 1 ? 'hello' run time: 2.6532e-005 sec #2 ------------------------------------------------------- Device #1 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 5 localID dim1: 5 globalID dim2: 11 localID dim2: 5 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 1255 ? 'hello' run time: 2.7256e-005 sec Device #2 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 9 localID dim1: 3 globalID dim2: 10 localID dim2: 4 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 8 ? 'hello' run time: 4.2752e-005 sec Device #1 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 5 localID dim1: 5 globalID dim2: 5 localID dim2: 5 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 216 ? 'hello' run time: 3.748e-006 sec Device #2 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 3 localID dim1: 3 globalID dim2: 4 localID dim2: 4 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 1 ? 'hello' run time: 2.6604e-005 sec #3 ------------------------------------------------------- Device #1 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 11 localID dim1: 5 globalID dim2: 11 localID dim2: 5 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 1091 ? 'hello' run time: 2.6233e-005 sec Device #2 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 9 localID dim1: 3 globalID dim2: 10 localID dim2: 4 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 7 ? 'hello' run time: 4.2707e-005 sec Device #1 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 5 localID dim1: 5 globalID dim2: 5 localID dim2: 5 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 216 ? 'hello' run time: 5.451e-006 sec Device #2 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 3 localID dim1: 3 globalID dim2: 4 localID dim2: 4 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 1 ? 'hello' run time: 2.6561e-005 sec #4 ------------------------------------------------------- Device #1 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 11 localID dim1: 5 globalID dim2: 11 localID dim2: 5 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 1046 ? 'hello' run time: 2.6575e-005 sec Device #2 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 9 localID dim1: 3 globalID dim2: 10 localID dim2: 4 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 8 ? 'hello' run time: 4.2721e-005 sec Device #1 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 5 localID dim1: 5 globalID dim2: 5 localID dim2: 5 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 216 ? 'hello' run time: 5.452e-006 sec Device #2 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 3 localID dim1: 3 globalID dim2: 4 localID dim2: 4 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 1 ? 'hello' run time: 2.663e-005 sec #5 ------------------------------------------------------- Device #1 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 11 localID dim1: 5 globalID dim2: 11 localID dim2: 5 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 1194 ? 'hello' run time: 2.4871e-005 sec Device #2 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 9 localID dim1: 3 globalID dim2: 10 localID dim2: 4 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 8 ? 'hello' run time: 4.267e-005 sec Device #1 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 5 localID dim1: 5 globalID dim2: 5 localID dim2: 5 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 216 ? 'hello' run time: 5.11e-006 sec Device #2 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 3 localID dim1: 3 globalID dim2: 4 localID dim2: 4 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 1 ? 'hello' run time: 2.6582e-005 sec #6 ------------------------------------------------------- Device #1 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 5 localID dim1: 5 globalID dim2: 11 localID dim2: 5 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 1209 ? 'hello' run time: 2.8278e-005 sec Device #2 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 9 localID dim1: 3 globalID dim2: 10 localID dim2: 4 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 8 ? 'hello' run time: 4.2902e-005 sec Device #1 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 5 localID dim1: 5 globalID dim2: 5 localID dim2: 5 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 216 ? 'hello' run time: 5.11e-006 sec Device #2 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 3 localID dim1: 3 globalID dim2: 4 localID dim2: 4 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 1 ? 'hello' run time: 2.6563e-005 sec #7 ------------------------------------------------------- Device #1 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 11 localID dim1: 5 globalID dim2: 11 localID dim2: 5 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 1031 ? 'hello' run time: 2.5893e-005 sec Device #2 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 9 localID dim1: 3 globalID dim2: 10 localID dim2: 4 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 8 ? 'hello' run time: 4.2632e-005 sec Device #1 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 5 localID dim1: 5 globalID dim2: 5 localID dim2: 5 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 216 ? 'hello' run time: 4.089e-006 sec Device #2 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 3 localID dim1: 3 globalID dim2: 4 localID dim2: 4 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 1 ? 'hello' run time: 2.6476e-005 sec #8 ------------------------------------------------------- Device #1 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 11 localID dim1: 5 globalID dim2: 11 localID dim2: 5 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 1073 ? 'hello' run time: 2.5552e-005 sec Device #2 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 9 localID dim1: 3 globalID dim2: 10 localID dim2: 4 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 8 ? 'hello' run time: 4.2632e-005 sec Device #1 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 5 localID dim1: 5 globalID dim2: 5 localID dim2: 5 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 216 ? 'hello' run time: 6.132e-006 sec Device #2 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 3 localID dim1: 3 globalID dim2: 4 localID dim2: 4 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 1 ? 'hello' run time: 2.6524e-005 sec #9 ------------------------------------------------------- Device #1 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 5 localID dim1: 5 globalID dim2: 11 localID dim2: 5 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 1248 ? 'hello' run time: 2.8959e-005 sec Device #2 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 9 localID dim1: 3 globalID dim2: 10 localID dim2: 4 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 8 ? 'hello' run time: 4.2656e-005 sec Device #1 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 5 localID dim1: 5 globalID dim2: 5 localID dim2: 5 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 216 ? 'hello' run time: 5.451e-006 sec Device #2 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 3 localID dim1: 3 globalID dim2: 4 localID dim2: 4 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 1 ? 'hello' run time: 2.6531e-005 sec #10 ------------------------------------------------------- Device #1 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 11 localID dim1: 5 globalID dim2: 11 localID dim2: 5 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 1150 ? 'hello' run time: 2.5552e-005 sec Device #2 global size: ( 12, 12, 12 ) local size: ( 6, 6, 6 ) globalID dim1: 9 localID dim1: 3 globalID dim2: 10 localID dim2: 4 globalID dim3: 11 localID dim3: 5 number of iterations (12^3): 8 ? 'hello' run time: 4.2712e-005 sec Device #1 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 5 localID dim1: 5 globalID dim2: 5 localID dim2: 5 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 216 ? 'hello' run time: 5.111e-006 sec Device #2 global size: ( 6, 6, 6 ) local size: ( 6, 6, 6 ) globalID dim1: 3 localID dim1: 3 globalID dim2: 4 localID dim2: 4 globalID dim3: 5 localID dim3: 5 number of iterations (6^3): 1 ? 'hello' run time: 2.6555e-005 sec Done Passed!

        • Disturbing numbers
          nou

          no your hw is fine. your code is broken.in your kernel you write value to the same location from multiple kernel.you should make something like this

          out[get_global_id(0)*7] =

          out[get_global_id(0)*7+1] =

          out[get_global_id(0)*7+2] =

          out[get_global_id(0)*7+3] =

          etc.

            • Disturbing numbers
              Stib

              I wanted to write to the same place. In the output i should see the last ID's, and because the incrementation in out[6] the number of iterations.

              But the last used ID's seem not to be correct, but my command queue is not in "out of order execution mode", so the last ID's should be the last in the ID spaces. And the iteration number seems not to be correct too. If i run a kernel ( 6, 6, 6) => 6^3 times, it should be 216! Not 1 or 8. As by a global work group of ( 12, 12, 12 ) => 12^3 should be 1728! Here, the CPU misses BIG TIMES too!

              So have i understand something wrong, or what's the problem??

              But, the biggest problem is, that the output is not only wrong, it is also not the same each time i run the same code!! I'm sure, it can be no memory problem as before, there are no "wild" pointers in my code now.

                • Disturbing numbers
                  nou

                  kernel are supposed to run at a same time. thus when you run global_work(12 12 12) you run in fact 1728 parralel thread which is run at the same time.

                  read this http://en.wikipedia.org/wiki/Critical_section

                  in fact in OpenCL there is no lock and semaphors. there is only synchronize point between local group thread and atomic operation (which is supported currently only on CPU and GeForce, in the future there is support in radeon 5xxx not 4xxx)

                  only read - many thread from same place

                  read and/or write - each thread own place to write(read).

                    • Disturbing numbers
                      Stib

                      Ok, i think...i get it. Somehow. So, with an 5xxx card could i do, what i wanted to do now?

                        • Disturbing numbers
                          nou

                          yes with atomic operation (which is supported on CPU now and will be on 5xxx). but with this you will lose whole parralelism. i recomend read 3. section of OpenCL specification and some info about paralallel programing.

                            • Disturbing numbers
                              Stib

                              Yeah, i think i slowly get it. When i started this ca. 2 weeks ago, i did know nothing about paralel programing, but i get the feeling for it slowly.

                              This forum helps me a lot, with everything!

                              *beer*