18 Replies Latest reply on Jan 9, 2013 12:02 AM by heman

    incorect result on GPU

    nou

      i have problem that on CPU it gives right results. but if run my code on GPU it return random results.

      i pinpoint that problem is in this function. event_vector array did not contain proper values.

      you can find whole code here http://orwell.fiit.stuba.sk/git?p=cellula.git;a=blob;f=kernels/asyn.cl;h=95f69b0872a356b035bc6200851cd7da1bdc3713;hb=da36c88bdc666c12c8fe5131a6c59ca8e48eae30

      #define NEIGHBOR 8 #define QUEUE_LEN 8 #define COORD(c) (c.y*get_global_size(0) + c.x) typedef float _time; typedef float _state; typedef struct { _time x; _state y; }_event; typedef struct { int start,end; int start2,end2; _event state; _event queue[QUEUE_LEN]; }_cell; void create_event_vector(__global _cell *cells, _event *event_vector, int2 *coord, int *index) { for(int i=0;i<NEIGHBOR;i++) event_vector[i] = cells[COORD(coord[i])].queue[index[i]]; }

        • incorect result on GPU
          jhabig

          Hi nou!

          I just saw your topic by did't have the time to go trough the code. I did have the same behaviour and it all came down to the following.

          My Buffer did have the flag CL_MEM_USE_HOST_PTR. For the CPU it turned out that the CL kernl just used that bit of memory my HostBuffer was defined on. When using the GPU the HostBuffer and the kernel memory are not the same,  and I had to query clEnqueueReadBuffer to get the data. That command need's to be called in the blocking way or you need to wait for an event befor using the results.

          I hope that helps, but you have possibly already thought of this. As I stated above I haven't had the time to look into your code, yet.

          • incorect result on GPU
            MicahVillmow
            Nou,
            There are issues with structures in private/local memory in SDK 2.01. This should be fixed in the upcoming release. However, as an optimization, the approach you are taking for passing arguments to a kernel is not optimal.
            This:
            __kernel void takeStep(__global _cell *cells, _time start_time, _time time_window, __global float16 *debug)
            Will perform a lot better if you do this:
            __kernel void takeStep(__global _cell *cells, constant _time* start_time, constant _time* time_window, __global float16 *debug)

            The reason being that private memory is mapped to the hardware private memory when it is dynamically indexed or a structure is passed by value to a kernel. The hardware private memory is stored in on-device ram and is uncached. If you turn the pass by value into pass by pointer of constant address space, the structures are then loaded into the constant cache, which at peak rate is about 3/5th's the speed of register access.

            There also is a feature being introduced in our upcoming release that will allow more efficient access into the constant buffer with the attribute max_constant_size(#Bytes).

            So this kernel would be something like:
            __kernel void takeStep(__global _cell *cells, __constant _time* start_time __attribute__(max_constant_size(sizeof(_time)),
            __constant _time* time_window__attribute__(max_constant_size(sizeof(_time)),
            __global float16 *debug)

            What this will do is pack both _start_time and time_window into the same constant cache for better resource management as only 2 constant caches can be used at a time in a single ALU CF clause.

            I hope this helps. If there is still an issue with this kernel after the upcoming release, please let me know.
              • incorect result on GPU
                nou

                thanks for reply Micah. as you suggest i try replace my _event to float2 (in future i need to chose types dynamic in this structures like int,float or int,float,int etc.) but as i replace it now it freeze whole system. i remove that while(1) loop just for sure but it do not work. on CPU it is still working normaly.

                so i realy looking forward to next release.

              • incorect result on GPU
                MicahVillmow
                nou,
                A workaround for this until the next release is to use the __attribute__((max_reqd_workgroup_size(64, 1, 1))) to force smaller work-groups at compile time. The freeze is most likely caused by spilling register to scratch memory and the loop counter getting corrupted.
                • incorect result on GPU
                  MicahVillmow
                  nou,
                  Also the environment variable GPU_MAX_WORKGROUP_SIZE=64 should achieve the same behavior.
                  This should allow you to continue developing until the release comes out.
                  • incorect result on GPU
                    MicahVillmow
                    nou,
                    The problem is most likely in the codegen and how it handles private memory and structures. Try passing everything in global memory to get around a bug in our SW stack. There should be no problem in our upcoming release as I've spent a lot of time making sure this path works.
                      • incorect result on GPU
                        nou

                        today i tried my code on nVidia card. and it works after some modifications.

                        nVidia do not like this. __constant int2 neighbors[] = {-1,1, 1,1, -1,-1, 1,-1, 0,1, -1,0, 1,0, 0,-1};

                        it refuses compile wirk some strange build error. so i switch from int2 to int. then it work.

                      • incorect result on GPU
                        MicahVillmow
                        nou,
                        the correct way to do that is the following:
                        __constant int2 neighbors[] = {(int2)(-1,1), (int2)(1,1), (int2)(-1,-1), (int2)(1,-1), (int2)(0,1), (int2)(-1,0), (int2)(1,0), (int2)(0,-1)};
                        • Re: incorect result on GPU
                          faruk123

                          Hi

                          3d mark 11 consequences part appear to demonstrate my GPU core trimmest speed wrongly .Do this mean mean that 3d mark 11 is by my GPU at that know clock pace ,or is it just being display wrongly ?My card are x3 xFX 6970 in crossfire and the supply clock speed be supposed clock speed supposed to be 880MHZ and in reality i'have overclocked theme to 997mhz.