cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Biaowang
Adept II

Why the same code running on linux and windows differs so much (2X)

Hey:

I just have run my opencl kernel on both windows 7 and kbuntu 12.10 that built in an autotool project. the project is developed under linux and then ported to windows using MingW64+Msys. My platform is samsung 535U3C laptop equipped with A6-4455M (Trinity APU,  GPU part is HD7500G) . I profile my kernel time by Event with OpenCL API clGetEventProfilingInfo, how ever the same code result in huge performance difference, the kernel time measured under kubuntu is around 2 times faster than that measured under windows 7 + MingW64.

And What I want to do is eliminate the memory transfer between CPU and GPU on this integrated Chip, namely zero copy. unfortunately, this feature is only available under window according to the Table 4.2 of AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide4.pdf

so now I have an embarrassed situation faced, as even the copy time between CPU and GPU can be reduced to zero , I suffer a two times slower kernel under windows!!!

Any ideas why the kernel under windows will be so slow?

Best

Tags (1)
0 Likes
11 Replies
nou
Exemplar

Why the same code running on linux and windows differs so much (2X)

read/write in zero copy memory must go through PCIe bus so kernels can be slower.

0 Likes
Biaowang
Adept II

Re: Why the same code running on linux and windows differs so much (2X)

But What I measured is the kernel execution time (memory copy time is exclusive).

And If I get you correctly, I think zero copy memory will not go through PCIe bus any more, otherwise, why it is called zero copy?

0 Likes
Biaowang
Adept II

Re: Why the same code running on linux and windows differs so much (2X)

Gautam's reply:

Windows + Mingw is a simulated linux environment, i would expect it to be slower that linux itself. Also the performance of sample may depend on the problem size you are running it for, and the System configuration you are testing on. Are you using the same system to run the sample in kubuntu and Win7 + Mingw? Try with increased problem size, by default samples run for a small size, in order to execute quickly.

Dear Gautam:

I have the same suspect that this results from the "simulated" environment, then I execute my program under windows "cmd" terminal, the same results still. so how could I compare the two executable?

0 Likes
himanshu_gautam
Grandmaster

Re: Why the same code running on linux and windows differs so much (2X)

kernel execution time includes the time taken to load global memory buffer into LDS or registers from global memory. By zero copy it is implied that the same location in RAM is used for GPU, that was allocated in host code (via CPU). Now RAM is actually divided into regions like: local memory (for GPU), uncacheable memory, cacheable memory. Performance of read/write vary in these regions. Check http://amddevcentral.com/afds/assets/presentations/1004_final.pdf for details.

0 Likes
himanshu_gautam
Grandmaster

Re: Why the same code running on linux and windows differs so much (2X)

Cygwin is not a simulated environment. It just provides a simple DLL which accepts Unix APIs and possibly uses windows APIs underneath (after say converting the file paths etc..). Apart from that, there is no simulation.

Note: My previous reply on memory hierarchy is specific to APUs. The PDF has more details.

0 Likes
himanshu_gautam
Grandmaster

Re: Why the same code running on linux and windows differs so much (2X)

Can you post the "clinfo" output from windows and linux environment? Look for the clock speed in it.... That might throw some insight.

Also Post the logs related to the performance of the application(for windows & linux). Is it your own application or some SDK Sample?

Message was edited by: Himanshu Gautam

0 Likes
Biaowang
Adept II

Re: Why the same code running on linux and windows differs so much (2X)


himanshu.gautam wrote:



Can you post the "clinfo" output from windows and linux environment? Look for the clock speed in it.... That might throw some insight.



Also Post the logs related to the performance of the application(for windows & linux). Is it your own application or some SDK Sample?



Message was edited by: Himanshu Gautam



Dear Himanshu:

though postpone a little bit,  but I had collectd the clinfo  both under windows and linux. and the following is the difference show by command "diff A6-4455MLinux_clinfo.txt A6-4455MWin_clinfo.txt ":

--------------------------------------------------------------------------------------------------I am the boundary--------------------------------------------------------------------------------------------------

3c3

<   Platform Version:                            OpenCL 1.2 AMD-APP (923.1)

---

>   Platform Version:                            OpenCL 1.2 AMD-APP (1124.2)

6c6

<   Platform Extensions:                                 cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

---

>   Platform Extensions:                                 cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing cl_khr_d3d11_sharing

14d13

<   Device Topology:                             PCI[ B#0, D#1, F#0 ]

35c34

<   Max memory allocation:                       134217728

---

>   Max memory allocation:                       200540160

39,40c38,39

<   Max image 2D width:                          8192

<   Max image 2D height:                                 8192

---

>   Max image 2D width:                          16384

>   Max image 2D height:                                 16384

58c57

<   Global memory size:                          268435456

---

>   Global memory size:                          536870912

76c75

<   Platform ID:                                         0x00007f60fdd1c140

---

>   Platform ID:                                         000007FEEBA62FF8

80c79

<   Driver version:                              CAL 1.4.1741

---

>   Driver version:                              1124.2 (VM)

82,83c81,82

<   Version:                                     OpenCL 1.2 AMD-APP (923.1)

<   Extensions:                                  cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt

---

>   Version:                                     OpenCL 1.2 AMD-APP (1124.2)

>   Extensions:                                  cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing

99,100c98,99

<   Preferred vector width float:                        4

<   Preferred vector width double:               0

---

>   Preferred vector width float:                        8

>   Preferred vector width double:               4

105,107c104,106

<   Native vector width float:                   4

<   Native vector width double:                  0

<   Max clock frequency:                                 1300Mhz

---

>   Native vector width float:                   8

>   Native vector width double:                  4

>   Max clock frequency:                                 2096Mhz

132c131

<   Global memory size:                          7807614976

---

>   Global memory size:                          8014217216

140c139

<   Profiling timer resolution:                  1

---

>   Profiling timer resolution:                  488

150c149

<   Platform ID:                                         0x00007f60fdd1c140

---

>   Platform ID:                                         000007FEEBA62FF8

154c153

<   Driver version:                              2.0 (sse2,avx,fma4)

---

>   Driver version:                              1124.2 (sse2,avx,fma4)

156,157c155,156

<   Version:                                     OpenCL 1.2 AMD-APP (923.1)

<   Extensions:                                  cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt

---

>   Version:                                     OpenCL 1.2 AMD-APP (1124.2)

>   Extensions:                                  cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing

--------------------------------------------------------------------------------------------------I am the boundary--------------------------------------------------------------------------------------------------

I make the font of differences in Bold and Italic.  I assume the Max clock frequency may be the cause of my problem (no zero copy, just the same code running on  GPU:HD 7500G). However I used GPU-Z to get the Max clock frequency of my GPU is only neither 1300Mhz nor

2096Mhz, 423MHz instead. Any idea?

BTW, What is the VM stands for in the driver Version?

0 Likes
nou
Exemplar

Re: Why the same code running on linux and windows differs so much (2X)

VM stands for virtual memory and IIRC it has something to do with zero copy support.

0 Likes
himanshu_gautam
Grandmaster

Re: Why the same code running on linux and windows differs so much (2X)

VM refers to the functionality in AMD drivers to Translate GPU addresses used in kernel to memory locations in RAM in Host PC.

If you use ALLOC_HOST_PTR, the RT will automatically use VM to access memory from kernels - there is no need to copy it to GPU. This is usually slow and should be used only when the addresses are accessed only once.

0 Likes