13 Replies Latest reply on Apr 28, 2013 5:25 PM by Biaowang

    Why the same code running on linux and windows differs so much (2X)

    Biaowang

      Hey:

       

       

      I just have run my opencl kernel on both windows 7 and kbuntu 12.10 that built in an autotool project. the project is developed under linux and then ported to windows using MingW64+Msys. My platform is samsung 535U3C laptop equipped with A6-4455M (Trinity APU,  GPU part is HD7500G) . I profile my kernel time by Event with OpenCL API clGetEventProfilingInfo, how ever the same code result in huge performance difference, the kernel time measured under kubuntu is around 2 times faster than that measured under windows 7 + MingW64.

       

       

      And What I want to do is eliminate the memory transfer between CPU and GPU on this integrated Chip, namely zero copy. unfortunately, this feature is only available under window according to the Table 4.2 of AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide4.pdf

      so now I have an embarrassed situation faced, as even the copy time between CPU and GPU can be reduced to zero , I suffer a two times slower kernel under windows!!!

       

       

      Any ideas why the kernel under windows will be so slow?

      Best

        • Why the same code running on linux and windows differs so much (2X)
          nou

          read/write in zero copy memory must go through PCIe bus so kernels can be slower.

          • Re: Why the same code running on linux and windows differs so much (2X)
            Biaowang

            Gautam's reply:

            Windows + Mingw is a simulated linux environment, i would expect it to be slower that linux itself. Also the performance of sample may depend on the problem size you are running it for, and the System configuration you are testing on. Are you using the same system to run the sample in kubuntu and Win7 + Mingw? Try with increased problem size, by default samples run for a small size, in order to execute quickly.

             

            Dear Gautam:

             

            I have the same suspect that this results from the "simulated" environment, then I execute my program under windows "cmd" terminal, the same results still. so how could I compare the two executable?

            • Re: Why the same code running on linux and windows differs so much (2X)
              himanshu.gautam

              Can you post the "clinfo" output from windows and linux environment? Look for the clock speed in it.... That might throw some insight.

               

              Also Post the logs related to the performance of the application(for windows & linux). Is it your own application or some SDK Sample?

               

              Message was edited by: Himanshu Gautam

                • Re: Why the same code running on linux and windows differs so much (2X)
                  Biaowang

                  himanshu.gautam wrote:

                   

                  Can you post the "clinfo" output from windows and linux environment? Look for the clock speed in it.... That might throw some insight.

                   

                  Also Post the logs related to the performance of the application(for windows & linux). Is it your own application or some SDK Sample?

                   

                  Message was edited by: Himanshu Gautam

                  Dear Himanshu:

                   

                  though postpone a little bit,  but I had collectd the clinfo  both under windows and linux. and the following is the difference show by command "diff A6-4455MLinux_clinfo.txt A6-4455MWin_clinfo.txt ":

                  --------------------------------------------------------------------------------------------------I am the boundary--------------------------------------------------------------------------------------------------

                  3c3

                  <   Platform Version:                            OpenCL 1.2 AMD-APP (923.1)

                  ---

                  >   Platform Version:                            OpenCL 1.2 AMD-APP (1124.2)

                  6c6

                  <   Platform Extensions:                                 cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

                  ---

                  >   Platform Extensions:                                 cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing cl_khr_d3d11_sharing

                  14d13

                  <   Device Topology:                             PCI[ B#0, D#1, F#0 ]

                  35c34

                  <   Max memory allocation:                       134217728

                  ---

                  >   Max memory allocation:                       200540160

                  39,40c38,39

                  <   Max image 2D width:                          8192

                  <   Max image 2D height:                                 8192

                  ---

                  >   Max image 2D width:                          16384

                  >   Max image 2D height:                                 16384

                  58c57

                  <   Global memory size:                          268435456

                  ---

                  >   Global memory size:                          536870912

                  76c75

                  <   Platform ID:                                         0x00007f60fdd1c140

                  ---

                  >   Platform ID:                                         000007FEEBA62FF8

                  80c79

                  <   Driver version:                              CAL 1.4.1741

                  ---

                  >   Driver version:                              1124.2 (VM)

                  82,83c81,82

                  <   Version:                                     OpenCL 1.2 AMD-APP (923.1)

                  <   Extensions:                                  cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt

                  ---

                  >   Version:                                     OpenCL 1.2 AMD-APP (1124.2)

                  >   Extensions:                                  cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing

                  99,100c98,99

                  <   Preferred vector width float:                        4

                  <   Preferred vector width double:               0

                  ---

                  >   Preferred vector width float:                        8

                  >   Preferred vector width double:               4

                  105,107c104,106

                  <   Native vector width float:                   4

                  <   Native vector width double:                  0

                  <   Max clock frequency:                                 1300Mhz

                  ---

                  >   Native vector width float:                   8

                  >   Native vector width double:                  4

                  >   Max clock frequency:                                 2096Mhz

                  132c131

                  <   Global memory size:                          7807614976

                  ---

                  >   Global memory size:                          8014217216

                  140c139

                  <   Profiling timer resolution:                  1

                  ---

                  >   Profiling timer resolution:                  488

                  150c149

                  <   Platform ID:                                         0x00007f60fdd1c140

                  ---

                  >   Platform ID:                                         000007FEEBA62FF8

                  154c153

                  <   Driver version:                              2.0 (sse2,avx,fma4)

                  ---

                  >   Driver version:                              1124.2 (sse2,avx,fma4)

                  156,157c155,156

                  <   Version:                                     OpenCL 1.2 AMD-APP (923.1)

                  <   Extensions:                                  cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt

                  ---

                  >   Version:                                     OpenCL 1.2 AMD-APP (1124.2)

                  >   Extensions:                                  cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing

                  --------------------------------------------------------------------------------------------------I am the boundary--------------------------------------------------------------------------------------------------

                   

                  I make the font of differences in Bold and Italic.  I assume the Max clock frequency may be the cause of my problem (no zero copy, just the same code running on  GPU:HD 7500G). However I used GPU-Z to get the Max clock frequency of my GPU is only neither 1300Mhz nor

                  2096Mhz, 423MHz instead. Any idea?

                  BTW, What is the VM stands for in the driver Version?

                • Re: Why the same code running on linux and windows differs so much (2X)
                  Biaowang

                  Just put some update of my effort to figure out what happen.

                  I have both kubuntu 12.10 and window 7 install on my laptop.

                  And my host code support offline compilation of the kernel code.

                  So I put the PTX binary code compiled under kubuntu to window 7 OS, and run my kernel using offline compilation which is compiled under kubuntu.

                  The performance is still 2 times slower than that of kubuntu.

                  So I suspect that it is the driver problem, as my driver under windows 7 is OEM specific and is not verified successfully by the AMD verification tool for the latest driver.

                  my platform is Samsung np530u3c ultrabook