14 Replies Latest reply on Aug 17, 2012 9:54 AM by yurtesen

    Map/Unmap performance of SDK 2.7 with Catalyst 12.6

    yurtesen

      To summarize it, I am getting very poor map/unmap performance. The device is a Radeon 7970 which runs on a bulldozer based machine (so NOT PCIe 3.0, but PCIe 2.1 16x).

       

      I used both CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR, ran map/unmap at least twice to make sure that the issue is not the pinning costs. The result is for 2.7GB of data, the transfer takes about 1.8 seconds which makes roughtly 1.5GB/sec transfer speed (whch is ridiculously slow.

       

      With Tesla M2050 for 2.5GB of data, the map/unmap takes about 0.5 secoonds... which makes about 5GB/sec expected speed... Also the write map copies the data from card to the host memory and back, while read map correctly only copies data from card to host.

       

      I am not sure if AMD's implementation does some trick and do not copy from device to host because there was no kernel ran in the device in my test case (which would be smart), maybe thats why it does not transfer it in? In either case, AMD loses when map/unmap is used. So what is the reason?  A bug in the SDK? The sample program attached. Can anybody have a look?

       

      So, can you tell

       

      On Radeon 7970

      Size: 225000000 x 4 = 900000000 bytes * 3 = 2700000000
      Tahiti
      WALL time for CL_MEM_ALLOC_HOST_PTR =   0.00 seconds
      CL_MAP_WRITE 0
      WALL time for Map #0 =   0.01 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for Unmap =   2.39 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      CL_MAP_WRITE 1
      WALL time for Map #1 =   0.00 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for Unmap =   1.76 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      CL_MAP_READ 0
      WALL time for Map #0 =   0.00 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for Unmap =   0.00 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      CL_MAP_READ 1
      WALL time for Map #1 =   0.00 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for Unmap =   0.00 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      Mapping with CL_MAP_WRITE and writing to mapped area
      WALL time for Map #1 =   0.00 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for write mapped area =   1.31 seconds
      WALL time for Unmap =   1.79 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      
      
      Allocating memory using memalign 4096
      WALL time for memalign =   0.00 seconds
      WALL time for CL_MEM_USE_HOST_PTR =   0.18 seconds
      CL_MAP_WRITE 0
      WALL time for Map #0 =   0.01 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for Unmap =   2.18 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      CL_MAP_WRITE 1
      WALL time for Map #1 =   0.00 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for Unmap =   1.74 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      CL_MAP_READ 0
      WALL time for Map #0 =   0.00 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for Unmap =   0.00 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      CL_MAP_READ 1
      WALL time for Map #1 =   0.00 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for Unmap =   0.00 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      Mapping with CL_MAP_WRITE and writing to mapped area
      WALL time for Map #1 =   0.00 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for write mapped area =   1.33 seconds
      WALL time for Unmap =   1.78 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      

       

      With Teslla M2050

      Size: 210000000 x 4 = 840000000 bytes * 3 = 2520000000
      Tesla M2050
      WALL time for CL_MEM_ALLOC_HOST_PTR =   0.00 seconds
      CL_MAP_WRITE 0
      WALL time for Map #0 =   0.53 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for Unmap =   0.53 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      CL_MAP_WRITE 1
      WALL time for Map #1 =   0.87 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for Unmap =   0.53 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      CL_MAP_READ 0
      WALL time for Map #0 =   0.83 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for Unmap =   0.10 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      CL_MAP_READ 1
      WALL time for Map #1 =   0.82 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for Unmap =   0.10 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      Mapping with CL_MAP_WRITE and writing to mapped area
      WALL time for Map #1 =   0.82 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for write mapped area =   0.54 seconds
      WALL time for Unmap =   0.57 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      
      
      Allocating memory using memalign 4096
      WALL time for memalign =   0.00 seconds
      WALL time for CL_MEM_USE_HOST_PTR =   0.59 seconds
      CL_MAP_WRITE 0
      WALL time for Map #0 =   0.51 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for Unmap =   0.48 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      CL_MAP_WRITE 1
      WALL time for Map #1 =   0.56 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for Unmap =   0.48 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      CL_MAP_READ 0
      WALL time for Map #0 =   0.56 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for Unmap =   0.01 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      CL_MAP_READ 1
      WALL time for Map #1 =   0.56 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for Unmap =   0.00 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      Mapping with CL_MAP_WRITE and writing to mapped area
      WALL time for Map #1 =   0.56 seconds
      Map #events: 3 time: 0.0000 seconds.
      WALL time for write mapped area =   0.32 seconds
      WALL time for Unmap =   0.48 seconds
      Unmap #events: 3 time: 0.0000 seconds.
      
        • Map/Unmap performance of SDK 2.7 with Catalyst 12.6
          nou

          did you tried run SDK BufferBandwith sample. it measure Map/Unmap speed.

            • Re: Map/Unmap performance of SDK 2.7 with Catalyst 12.6
              yurtesen

              nou wrote:

               

              did you tried run SDK BufferBandwith sample. it measure Map/Unmap speed.

              Yes, BufferBandwidth gives respectable results between 5 to 6 GB/s. However I am not sure exactly why my program is slower. The only difference I can see is that I used blocking mapping and I have 3 objects mapped back to back. I will perhaps reduce it to 1 object and try again. However it seems to work pretty fine on Nvidia cards, do you think there is a problem in my code?

            • Re: Map/Unmap performance of SDK 2.7 with Catalyst 12.6
              vanja_z

              Hi yurtesen,

               

              Welcome to the convoluted, implementation (un)defined world of OpenCL memory (mis)management.

               

              I have run your code for a range of buffer sizes on my machine (Arch Linux, Catalyst 12.6, HD6950 2GB) and found that the performance varies with buffer size. I am unable to test larger buffers since my device is unable to use more than around 60% of its memory (http://devgurus.amd.com/thread/158397).

               

              Total size (MB)CL_MAP_WRITE 1 time (s)Speed (GB/s)
              180

              0.00

              inf
              3600.075.14
              5400.202.70
              7200.282.57
              9000.551.64
              10800.661.64

               

              I haven't had the chance to look at your code in detail but my experiments with clEnqueue(Read/Write)Buffer have shown consistent speeds of around 6 GB/s are possible all the way to the largest buffers possible on the device (60% rated memory size). You may also want to take a look at the SDK BufferBandwith sample as Nou suggested.

               

              Good luck,

               

              Vanja

                • Re: Map/Unmap performance of SDK 2.7 with Catalyst 12.6
                  yurtesen

                  Vanja, did you try

                  export GPU_MAX_ALLOC_PRCENT=100

                  with that I am able to use all the memory in the device, perhaps dangerous to set it to 100%

                   

                  I will check my code and try to reduce map/unmap operations. Now it does 3 maps back to back, I wonder if it has something to do with that because the BufferBandwidth program which is bundled seems to work pretty nicely.

                    • Re: Map/Unmap performance of SDK 2.7 with Catalyst 12.6
                      yurtesen

                      Well, I am not able to find why my code is working much slower. I have reduced number of consequtive maps and unmaps to a single map/unmap and still it is slow.

                       

                      I now found out that it also does something strange if the size is over 900mb for a single buffer object? (updated single map / single unmap test is attached). Still the new code works on Nvidia devices over 900mb buffer object fine... I dont quite undestand why it doesnt like AMD OpenCL SDK.

                       

                      I would appreciate any feedback.

                       

                      nvidia

                      $ ./test_objalloc

                      Size: 225000000 x 4 = 900000000 bytesmax sizet 18446744073709551615

                      Tesla M2050

                      WALL time for CL_MEM_ALLOC_HOST_PTR =   0.00 seconds

                      CL_MAP_WRITE 0

                      WALL time for Map #0 =   0.19 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for Unmap =   0.20 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      CL_MAP_WRITE 1

                      WALL time for Map #1 =   0.32 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for Unmap =   0.19 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      CL_MAP_READ 0

                      WALL time for Map #0 =   0.32 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for Unmap =   0.05 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      CL_MAP_READ 1

                      WALL time for Map #1 =   0.32 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for Unmap =   0.05 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      Mapping with CL_MAP_WRITE and writing to mapped area

                      WALL time for Map #1 =   0.32 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for write mapped area =   0.36 seconds

                      WALL time for Unmap =   0.20 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                       

                       

                      Allocating memory using memalign 4096

                      WALL time for memalign =   0.00 seconds

                      WALL time for CL_MEM_USE_HOST_PTR =   0.21 seconds

                      CL_MAP_WRITE 0

                      WALL time for Map #0 =   0.16 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for Unmap =   0.15 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      CL_MAP_WRITE 1

                      WALL time for Map #1 =   0.14 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for Unmap =   0.15 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      CL_MAP_READ 0

                      WALL time for Map #0 =   0.14 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for Unmap =   0.00 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      CL_MAP_READ 1

                      WALL time for Map #1 =   0.14 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for Unmap =   0.00 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      Mapping with CL_MAP_WRITE and writing to mapped area

                      WALL time for Map #1 =   0.14 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for write mapped area =   0.22 seconds

                      WALL time for Unmap =   0.15 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      amd

                      $ ./test_objalloc

                      Size: 225000000 x 4 = 900000000 bytesmax sizet 18446744073709551615

                      Tahiti

                      WALL time for CL_MEM_ALLOC_HOST_PTR =   0.00 seconds

                      CL_MAP_WRITE 0

                      WALL time for Map #0 =   0.00 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for Unmap =   0.95 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      CL_MAP_WRITE 1

                      WALL time for Map #1 =   0.00 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for Unmap =   0.57 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      CL_MAP_READ 0

                      WALL time for Map #0 =   0.00 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for Unmap =   0.00 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      CL_MAP_READ 1

                      WALL time for Map #1 =   0.00 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for Unmap =   0.00 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      Mapping with CL_MAP_WRITE and writing to mapped area

                      WALL time for Map #1 =   0.00 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for write mapped area =   0.54 seconds

                      WALL time for Unmap =   0.58 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                       

                       

                       

                      Allocating memory using memalign 4096

                      WALL time for memalign =   0.00 seconds

                      WALL time for CL_MEM_USE_HOST_PTR =   0.07 seconds

                      CL_MAP_WRITE 0

                      WALL time for Map #0 =   0.00 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for Unmap =   0.71 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      CL_MAP_WRITE 1

                      WALL time for Map #1 =   0.00 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for Unmap =   0.58 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      CL_MAP_READ 0

                      WALL time for Map #0 =   0.00 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for Unmap =   0.00 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      CL_MAP_READ 1

                      WALL time for Map #1 =   0.00 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for Unmap =   0.00 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                      Mapping with CL_MAP_WRITE and writing to mapped area

                      WALL time for Map #1 =   0.00 seconds

                      Map #events: 1 time: 0.0000 seconds.

                      WALL time for write mapped area =   0.54 seconds

                      WALL time for Unmap =   0.58 seconds

                      Unmap #events: 1 time: 0.0000 seconds.

                        • Re: Map/Unmap performance of SDK 2.7 with Catalyst 12.6
                          Wenju

                          Hi yurtesen,

                          I can't compile your code successfully. I need more time to check it.

                            • Re: Map/Unmap performance of SDK 2.7 with Catalyst 12.6
                              yurtesen

                              It should compile successfully using the libraries/includes provided in the file. (at least one other person could compile it here). Allso I am able to compile it easily on Fedora 16, Fedora 10, Ubuntu 11.x, Ubuntu 12.x and Scientific Linux 6 (SL6) Can you tell what problem you are having?

                                • Re: Map/Unmap performance of SDK 2.7 with Catalyst 12.6
                                  Wenju

                                  test_objalloc.cpp: In function ‘void mapwrite()’:

                                  test_objalloc.cpp:207:17: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]

                                  test_objalloc.cpp:205:24: warning: ‘i’ is used uninitialized in this function [-Wuninitialized]

                                  /tmp/cc8O7n4J.o: In function `cl_checkelapsedtime(std::vector<cl::Event, std::allocator<cl::Event> >, char const*, int)':

                                  test_objalloc.cpp:(.text+0xb6): undefined reference to `std::cout'

                                  test_objalloc.cpp:(.text+0xbb): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)'

                                  test_objalloc.cpp:(.text+0xca): undefined reference to `std::cout'

                                  test_objalloc.cpp:(.text+0xcf): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)'

                                  test_objalloc.cpp:(.text+0xd7): undefined reference to `std::cout'

                                  test_objalloc.cpp:(.text+0xdf): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<unsigned long>(unsigned long)'

                                  test_objalloc.cpp:(.text+0xf4): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)'

                                  test_objalloc.cpp:(.text+0x10e): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<double>(double)'

                                  test_objalloc.cpp:(.text+0x123): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)'

                                  test_objalloc.cpp:(.text+0x150): undefined reference to `std::basic_ostream<char, std::char_traits<char> >::put(char)'

                                  test_objalloc.cpp:(.text+0x158): undefined reference to `std::basic_ostream<char, std::char_traits<char> >::flush()'

                                  test_objalloc.cpp:(.text+0x174): undefined reference to `std::ctype<char>::_M_widen_init() const'

                                  test_objalloc.cpp:(.text+0x1b3): undefined reference to `std::cout'

                                  test_objalloc.cpp:(.text+0x1be): undefined reference to `std::cout'

                                  test_objalloc.cpp:(.text+0x1c9): undefined reference to `std::basic_ios<char, std::char_traits<char> >::clear(std::_Ios_Iostate)'

                                  test_objalloc.cpp:(.text+0x1d3): undefined reference to `std::__throw_bad_cast()'

                                  /tmp/cc8O7n4J.o: In function `checkErr(int, char const*)':

                                  test_objalloc.cpp:(.text+0x24b): undefined reference to `std::cerr'

                                  test_objalloc.cpp:(.text+0x250): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'

                                  test_objalloc.cpp:(.text+0x25b): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'

                                  test_objalloc.cpp:(.text+0x268): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'

                                  test_objalloc.cpp:(.text+0x272): undefined reference to `std::basic_ostream<char, std::char_traits<char> >::operator<<(int)'

                                  test_objalloc.cpp:(.text+0x27f): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'

                                  test_objalloc.cpp:(.text+0x28c): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'

                                  test_objalloc.cpp:(.text+0x297): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'

                                  test_objalloc.cpp:(.text+0x2a4): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'

                                  test_objalloc.cpp:(.text+0x2ac): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::endl<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&)'

                                  /tmp/cc8O7n4J.o: In function `map(int, unsigned long)':

                                  test_objalloc.cpp:(.text+0x483): undefined reference to `operator new(unsigned long)'

                                  test_objalloc.cpp:(.text+0x524): undefined reference to `operator delete(void*)'

                                  test_objalloc.cpp:(.text+0x594): undefined reference to `operator delete(void*)'

                                  test_objalloc.cpp:(.text+0x5de): undefined reference to `operator delete(void*)'

                                  test_objalloc.cpp:(.text+0x600): undefined reference to `operator delete(void*)'

                                  test_objalloc.cpp:(.text+0x61e): undefined reference to `std::__throw_bad_alloc()'

                                  test_objalloc.cpp:(.text+0x635): undefined reference to `__cxa_begin_catch'

                                  test_objalloc.cpp:(.text+0x650): undefined reference to `__cxa_rethrow'

                                  test_objalloc.cpp:(.text+0x658): undefined reference to `__cxa_end_catch'

                                  /tmp/cc8O7n4J.o: In function `unmap()':

                                  test_objalloc.cpp:(.text+0x7f4): undefined reference to `operator new(unsigned long)'

                                  test_objalloc.cpp:(.text+0x894): undefined reference to `operator delete(void*)'

                                  test_objalloc.cpp:(.text+0x903): undefined reference to `operator delete(void*)'

                                  test_objalloc.cpp:(.text+0x943): undefined reference to `__cxa_begin_catch'

                                  test_objalloc.cpp:(.text+0x95e): undefined reference to `__cxa_rethrow'

                                  test_objalloc.cpp:(.text+0x968): undefined reference to `std::__throw_bad_alloc()'

                                  test_objalloc.cpp:(.text+0x970): undefined reference to `__cxa_end_catch'

                                  test_objalloc.cpp:(.text+0x97f): undefined reference to `operator delete(void*)'

                                  test_objalloc.cpp:(.text+0x9bc): undefined reference to `operator delete(void*)'

                                  /tmp/cc8O7n4J.o: In function `std::vector<cl::Device, std::allocator<cl::Device> >::~vector()':

                                  test_objalloc.cpp:(.text._ZNSt6vectorIN2cl6DeviceESaIS1_EED2Ev[_ZNSt6vectorIN2cl6DeviceESaIS1_EED5Ev]+0x5a): undefined reference to `operator delete(void*)'

                                  test_objalloc.cpp:(.text._ZNSt6vectorIN2cl6DeviceESaIS1_EED2Ev[_ZNSt6vectorIN2cl6DeviceESaIS1_EED5Ev]+0x3f): undefined reference to `operator delete(void*)'

                                  /tmp/cc8O7n4J.o: In function `std::vector<cl::Event, std::allocator<cl::Event> >::~vector()':

                                  test_objalloc.cpp:(.text._ZNSt6vectorIN2cl5EventESaIS1_EED2Ev[_ZNSt6vectorIN2cl5EventESaIS1_EED5Ev]+0x5a): undefined reference to `operator delete(void*)'

                                  /tmp/cc8O7n4J.o:test_objalloc.cpp:(.text._ZNSt6vectorIN2cl5EventESaIS1_EED2Ev[_ZNSt6vectorIN2cl5EventESaIS1_EED5Ev]+0x3f): more undefined references to `operator delete(void*)' follow

                                  /tmp/cc8O7n4J.o: In function `std::vector<cl::Event, std::allocator<cl::Event> >::_M_insert_aux(__gnu_cxx::__normal_iterator<cl::Event*, std::vector<cl::Event, std::allocator<cl::Event> > >, cl::Event const&)':

                                  test_objalloc.cpp:(.text._ZNSt6vectorIN2cl5EventESaIS1_EE13_M_insert_auxEN9__gnu_cxx17__normal_iteratorIPS1_S3_EERKS1_[std::vector<cl::Event, std::allocator<cl::Event> >::_M_insert_aux(__gnu_cxx::__normal_iterator<cl::Event*, std::vector<cl::Event, std::allocator<cl::Event> > >, cl::Event const&)]+0x17e): undefined reference to `operator new(unsigned long)'

                                  test_objalloc.cpp:(.text._ZNSt6vectorIN2cl5EventESaIS1_EE13_M_insert_auxEN9__gnu_cxx17__normal_iteratorIPS1_S3_EERKS1_[std::vector<cl::Event, std::allocator<cl::Event> >::_M_insert_aux(__gnu_cxx::__normal_iterator<cl::Event*, std::vector<cl::Event, std::allocator<cl::Event> > >, cl::Event const&)]+0x244): undefined reference to `operator delete(void*)'

                                  test_objalloc.cpp:(.text._ZNSt6vectorIN2cl5EventESaIS1_EE13_M_insert_auxEN9__gnu_cxx17__normal_iteratorIPS1_S3_EERKS1_[std::vector<cl::Event, std..

                                  Any ideas?

                                    • Re: Map/Unmap performance of SDK 2.7 with Catalyst 12.6
                                      yurtesen

                                      Can you tell the command line you are giving to compiler? Are you sure that you are using c++ compiler? Which compiler are you using? Which version of the compiler? Did you open all the files to the same directory?

                                       

                                      Can you try:

                                      g++ -O3 -g -I./include test_objalloc.cpp -L/opt/AMDAPP/lib -L./lib -lOpenCL -o test_objalloc

                                        • Re: Map/Unmap performance of SDK 2.7 with Catalyst 12.6
                                          Wenju

                                          Sorry, I'm just so stupid! However, I got the result from HD6970, but I reduced the size.

                                           

                                          Size: 22500000 x 4 = 90000000 bytesmax sizet 18446744073709551615
                                          Cypress
                                          WALL time for CL_MEM_ALLOC_HOST_PTR =   0.00 seconds
                                          CL_MAP_WRITE 0
                                          WALL time for Map #0 =   0.04 seconds
                                          Map #events: 1 time: 0.0000 seconds.
                                          WALL time for Unmap =   0.00 seconds
                                          Unmap #events: 1 time: 0.0000 seconds.
                                          CL_MAP_WRITE 1
                                          WALL time for Map #1 =   0.00 seconds
                                          Map #events: 1 time: 0.0000 seconds.
                                          WALL time for Unmap =   0.00 seconds
                                          Unmap #events: 1 time: 0.0000 seconds.
                                          CL_MAP_WRITE 2
                                          WALL time for Map #2 =   0.00 seconds
                                          Map #events: 1 time: 0.0000 seconds.
                                          WALL time for Unmap =   0.00 seconds
                                          Unmap #events: 1 time: 0.0000 seconds.
                                          CL_MAP_READ 0
                                          WALL time for Map #0 =   0.00 seconds
                                          Map #events: 1 time: 0.0000 seconds.
                                          WALL time for Unmap =   0.00 seconds
                                          Unmap #events: 1 time: 0.0000 seconds.
                                          CL_MAP_READ 1
                                          WALL time for Map #1 =   0.00 seconds
                                          Map #events: 1 time: 0.0000 seconds.
                                          WALL time for Unmap =   0.00 seconds
                                          Unmap #events: 1 time: 0.0000 seconds.
                                          CL_MAP_READ 2
                                          WALL time for Map #2 =   0.00 seconds
                                          Map #events: 1 time: 0.0000 seconds.
                                          WALL time for Unmap =   0.00 seconds
                                          Unmap #events: 1 time: 0.0000 seconds.
                                          Mapping with CL_MAP_WRITE and writing to mapped area
                                          WALL time for Map #1 =   0.00 seconds
                                          Map #events: 1 time: 0.0000 seconds.
                                          WALL time for write mapped area =   0.03 seconds
                                          WALL time for Unmap =   0.00 seconds
                                          Unmap #events: 1 time: 0.0000 seconds.

                                           

                                          Allocating memory using memalign 4096
                                          WALL time for memalign =   0.00 seconds
                                          WALL time for CL_MEM_USE_HOST_PTR =   0.03 seconds
                                          CL_MAP_WRITE 0
                                          WALL time for Map #0 =   0.03 seconds
                                          Map #events: 1 time: 0.0000 seconds.
                                          WALL time for Unmap =   0.00 seconds
                                          Unmap #events: 1 time: 0.0000 seconds.
                                          CL_MAP_WRITE 1
                                          WALL time for Map #1 =   0.00 seconds
                                          Map #events: 1 time: 0.0000 seconds.
                                          WALL time for Unmap =   0.00 seconds
                                          Unmap #events: 1 time: 0.0000 seconds.
                                          CL_MAP_WRITE 2
                                          WALL time for Map #2 =   0.00 seconds
                                          Map #events: 1 time: 0.0000 seconds.
                                          WALL time for Unmap =   0.00 seconds
                                          Unmap #events: 1 time: 0.0000 seconds.
                                          CL_MAP_READ 0
                                          WALL time for Map #0 =   0.00 seconds
                                          Map #events: 1 time: 0.0000 seconds.
                                          WALL time for Unmap =   0.00 seconds
                                          Unmap #events: 1 time: 0.0000 seconds.
                                          CL_MAP_READ 1
                                          WALL time for Map #1 =   0.00 seconds
                                          Map #events: 1 time: 0.0000 seconds.
                                          WALL time for Unmap =   0.00 seconds
                                          Unmap #events: 1 time: 0.0000 seconds.
                                          CL_MAP_READ 2
                                          WALL time for Map #2 =   0.00 seconds
                                          Map #events: 1 time: 0.0000 seconds.
                                          WALL time for Unmap =   0.00 seconds
                                          Unmap #events: 1 time: 0.0000 seconds.
                                          Mapping with CL_MAP_WRITE and writing to mapped area
                                          WALL time for Map #1 =   0.00 seconds
                                          Map #events: 1 time: 0.0000 seconds.
                                          WALL time for write mapped area =   0.03 seconds
                                          WALL time for Unmap =   0.00 seconds
                                          Unmap #events: 1 time: 0.0000 seconds.

                                            • Re: Map/Unmap performance of SDK 2.7 with Catalyst 12.6
                                              yurtesen

                                              Wenju, those results are wrong. Try the first program (in first post) with 3 objects, I think there is a 2nd problem which does not let allocation of large amount of memory in a single buffer. (It appears it does nothing when the size is too large).

                                                • Re: Map/Unmap performance of SDK 2.7 with Catalyst 12.6
                                                  Wenju

                                                  Max memory allocation:  536870912, I used max size. So there is nothing wrong when you using 15000*15000? I failed.

                                                  This time I test it on 7970.

                                                  Size: 134217728 x 4 = 536870912 bytes * 3 = 1610612736
                                                  Tahiti
                                                  WALL time for CL_MEM_ALLOC_HOST_PTR =   0.00 seconds
                                                  CL_MAP_WRITE 0
                                                  WALL time for Map #0 =   0.05 seconds
                                                  Map #events: 3 time: 0.0000 seconds.
                                                  WALL time for Unmap =   1.17 seconds
                                                  Unmap #events: 3 time: 0.0000 seconds.
                                                  CL_MAP_WRITE 1
                                                  WALL time for Map #1 =   0.00 seconds
                                                  Map #events: 3 time: 0.0000 seconds.
                                                  WALL time for Unmap =   0.70 seconds
                                                  Unmap #events: 3 time: 0.0000 seconds.
                                                  CL_MAP_READ 0
                                                  WALL time for Map #0 =   0.00 seconds
                                                  Map #events: 3 time: 0.0000 seconds.
                                                  WALL time for Unmap =   0.00 seconds
                                                  Unmap #events: 3 time: 0.0000 seconds.
                                                  CL_MAP_READ 1
                                                  WALL time for Map #1 =   0.00 seconds
                                                  Map #events: 3 time: 0.0000 seconds.
                                                  WALL time for Unmap =   0.00 seconds
                                                  Unmap #events: 3 time: 0.0000 seconds.
                                                  Mapping with CL_MAP_WRITE and writing to mapped area
                                                  WALL time for Map #1 =   0.00 seconds
                                                  Map #events: 3 time: 0.0000 seconds.
                                                  WALL time for write mapped area =   0.54 seconds
                                                  WALL time for Unmap =   0.77 seconds
                                                  Unmap #events: 3 time: 0.0000 seconds.

                                                   

                                                  Allocating memory using memalign 4096
                                                  WALL time for memalign =   0.00 seconds
                                                  WALL time for CL_MEM_USE_HOST_PTR =   0.09 seconds
                                                  CL_MAP_WRITE 0
                                                  WALL time for Map #0 =   0.01 seconds
                                                  Map #events: 3 time: 0.0000 seconds.
                                                  WALL time for Unmap =   0.97 seconds
                                                  Unmap #events: 3 time: 0.0000 seconds.
                                                  CL_MAP_WRITE 1
                                                  WALL time for Map #1 =   0.00 seconds
                                                  Map #events: 3 time: 0.0000 seconds.
                                                  WALL time for Unmap =   0.69 seconds
                                                  Unmap #events: 3 time: 0.0000 seconds.
                                                  CL_MAP_READ 0
                                                  WALL time for Map #0 =   0.00 seconds
                                                  Map #events: 3 time: 0.0000 seconds.
                                                  WALL time for Unmap =   0.00 seconds
                                                  Unmap #events: 3 time: 0.0000 seconds.
                                                  CL_MAP_READ 1
                                                  WALL time for Map #1 =   0.00 seconds
                                                  Map #events: 3 time: 0.0000 seconds.
                                                  WALL time for Unmap =   0.00 seconds
                                                  Unmap #events: 3 time: 0.0000 seconds.
                                                  Mapping with CL_MAP_WRITE and writing to mapped area
                                                  WALL time for Map #1 =   0.00 seconds
                                                  Map #events: 3 time: 0.0000 seconds.
                                                  WALL time for write mapped area =   0.54 seconds
                                                  WALL time for Unmap =   0.77 seconds
                                                  Unmap #events: 3 time: 0.0000 seconds.

                                                    • Re: Map/Unmap performance of SDK 2.7 with Catalyst 12.6
                                                      Wenju

                                                      I am not sure if AMD's implementation does some trick and do not copy from device to host because there was no kernel ran in the device in my test case (which would be smart), maybe thats why it does not transfer it in?

                                                       

                                                      Yes, if the memory object is created by CL_MEM_USE_HOST_PTR/CL_MEM_ALLOC_HOST_PTR, the pointer which clEnqueueMapBuffer/clEnqueueMapImage returns will be all the same. And before data transfer, the runtime will track it, whether the pointer is a new one. If it's a new one, it'll not transfer the data.

                                                        • Re: Map/Unmap performance of SDK 2.7 with Catalyst 12.6
                                                          yurtesen

                                                          If I you map for reading, the implementation copies it from device to host. I am not sure what is wrong with the 'single object' version that I made exactly. The attached program should work on Cypress and my results on Cypress are like this (about 1.2gb/sec):

                                                          Size: 60000000 x 4 = 240000000 bytes * 3 = 720000000

                                                          Cypress

                                                          WALL time for CL_MEM_ALLOC_HOST_PTR =   0.00 seconds

                                                          CL_MAP_WRITE 0

                                                          WALL time for Map #0 =   0.14 seconds

                                                          Map #events: 3 time: 0.0000 seconds.

                                                          WALL time for Unmap =   0.48 seconds

                                                          Unmap #events: 3 time: 0.0000 seconds.

                                                          CL_MAP_WRITE 1

                                                          WALL time for Map #1 =   0.00 seconds

                                                          Map #events: 3 time: 0.0000 seconds.

                                                          WALL time for Unmap =   0.35 seconds

                                                          Unmap #events: 3 time: 0.0000 seconds.

                                                          ...

                                                          ...

                                                           

                                                          Anyway, the point was why the transfers are so slow...