cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

yurtesen
Miniboss

Map/Unmap performance of SDK 2.7 with Catalyst 12.6

To summarize it, I am getting very poor map/unmap performance. The device is a Radeon 7970 which runs on a bulldozer based machine (so NOT PCIe 3.0, but PCIe 2.1 16x).

I used both CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR, ran map/unmap at least twice to make sure that the issue is not the pinning costs. The result is for 2.7GB of data, the transfer takes about 1.8 seconds which makes roughtly 1.5GB/sec transfer speed (whch is ridiculously slow.

With Tesla M2050 for 2.5GB of data, the map/unmap takes about 0.5 secoonds... which makes about 5GB/sec expected speed... Also the write map copies the data from card to the host memory and back, while read map correctly only copies data from card to host.

I am not sure if AMD's implementation does some trick and do not copy from device to host because there was no kernel ran in the device in my test case (which would be smart), maybe thats why it does not transfer it in? In either case, AMD loses when map/unmap is used. So what is the reason?  A bug in the SDK? The sample program attached. Can anybody have a look?

So, can you tell

On Radeon 7970

Size: 225000000 x 4 = 900000000 bytes * 3 = 2700000000

Tahiti

WALL time for CL_MEM_ALLOC_HOST_PTR =   0.00 seconds

CL_MAP_WRITE 0

WALL time for Map #0 =   0.01 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   2.39 seconds

Unmap #events: 3 time: 0.0000 seconds.

CL_MAP_WRITE 1

WALL time for Map #1 =   0.00 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   1.76 seconds

Unmap #events: 3 time: 0.0000 seconds.

CL_MAP_READ 0

WALL time for Map #0 =   0.00 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   0.00 seconds

Unmap #events: 3 time: 0.0000 seconds.

CL_MAP_READ 1

WALL time for Map #1 =   0.00 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   0.00 seconds

Unmap #events: 3 time: 0.0000 seconds.

Mapping with CL_MAP_WRITE and writing to mapped area

WALL time for Map #1 =   0.00 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for write mapped area =   1.31 seconds

WALL time for Unmap =   1.79 seconds

Unmap #events: 3 time: 0.0000 seconds.

Allocating memory using memalign 4096

WALL time for memalign =   0.00 seconds

WALL time for CL_MEM_USE_HOST_PTR =   0.18 seconds

CL_MAP_WRITE 0

WALL time for Map #0 =   0.01 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   2.18 seconds

Unmap #events: 3 time: 0.0000 seconds.

CL_MAP_WRITE 1

WALL time for Map #1 =   0.00 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   1.74 seconds

Unmap #events: 3 time: 0.0000 seconds.

CL_MAP_READ 0

WALL time for Map #0 =   0.00 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   0.00 seconds

Unmap #events: 3 time: 0.0000 seconds.

CL_MAP_READ 1

WALL time for Map #1 =   0.00 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   0.00 seconds

Unmap #events: 3 time: 0.0000 seconds.

Mapping with CL_MAP_WRITE and writing to mapped area

WALL time for Map #1 =   0.00 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for write mapped area =   1.33 seconds

WALL time for Unmap =   1.78 seconds

Unmap #events: 3 time: 0.0000 seconds.

With Teslla M2050

Size: 210000000 x 4 = 840000000 bytes * 3 = 2520000000

Tesla M2050

WALL time for CL_MEM_ALLOC_HOST_PTR =   0.00 seconds

CL_MAP_WRITE 0

WALL time for Map #0 =   0.53 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   0.53 seconds

Unmap #events: 3 time: 0.0000 seconds.

CL_MAP_WRITE 1

WALL time for Map #1 =   0.87 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   0.53 seconds

Unmap #events: 3 time: 0.0000 seconds.

CL_MAP_READ 0

WALL time for Map #0 =   0.83 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   0.10 seconds

Unmap #events: 3 time: 0.0000 seconds.

CL_MAP_READ 1

WALL time for Map #1 =   0.82 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   0.10 seconds

Unmap #events: 3 time: 0.0000 seconds.

Mapping with CL_MAP_WRITE and writing to mapped area

WALL time for Map #1 =   0.82 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for write mapped area =   0.54 seconds

WALL time for Unmap =   0.57 seconds

Unmap #events: 3 time: 0.0000 seconds.

Allocating memory using memalign 4096

WALL time for memalign =   0.00 seconds

WALL time for CL_MEM_USE_HOST_PTR =   0.59 seconds

CL_MAP_WRITE 0

WALL time for Map #0 =   0.51 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   0.48 seconds

Unmap #events: 3 time: 0.0000 seconds.

CL_MAP_WRITE 1

WALL time for Map #1 =   0.56 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   0.48 seconds

Unmap #events: 3 time: 0.0000 seconds.

CL_MAP_READ 0

WALL time for Map #0 =   0.56 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   0.01 seconds

Unmap #events: 3 time: 0.0000 seconds.

CL_MAP_READ 1

WALL time for Map #1 =   0.56 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   0.00 seconds

Unmap #events: 3 time: 0.0000 seconds.

Mapping with CL_MAP_WRITE and writing to mapped area

WALL time for Map #1 =   0.56 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for write mapped area =   0.32 seconds

WALL time for Unmap =   0.48 seconds

Unmap #events: 3 time: 0.0000 seconds.

0 Likes
14 Replies
nou
Exemplar

did you tried run SDK BufferBandwith sample. it measure Map/Unmap speed.

0 Likes

nou wrote:

did you tried run SDK BufferBandwith sample. it measure Map/Unmap speed.

Yes, BufferBandwidth gives respectable results between 5 to 6 GB/s. However I am not sure exactly why my program is slower. The only difference I can see is that I used blocking mapping and I have 3 objects mapped back to back. I will perhaps reduce it to 1 object and try again. However it seems to work pretty fine on Nvidia cards, do you think there is a problem in my code?

0 Likes
vanja_z
Adept II

Hi yurtesen,

Welcome to the convoluted, implementation (un)defined world of OpenCL memory (mis)management.

I have run your code for a range of buffer sizes on my machine (Arch Linux, Catalyst 12.6, HD6950 2GB) and found that the performance varies with buffer size. I am unable to test larger buffers since my device is unable to use more than around 60% of its memory (http://devgurus.amd.com/thread/158397).

Total size (MB)CL_MAP_WRITE 1 time (s)Speed (GB/s)
180

0.00

inf
3600.075.14
5400.202.70
7200.282.57
9000.551.64
10800.661.64

I haven't had the chance to look at your code in detail but my experiments with clEnqueue(Read/Write)Buffer have shown consistent speeds of around 6 GB/s are possible all the way to the largest buffers possible on the device (60% rated memory size). You may also want to take a look at the SDK BufferBandwith sample as Nou suggested.

Good luck,

Vanja

0 Likes

Vanja, did you try

export GPU_MAX_ALLOC_PRCENT=100

with that I am able to use all the memory in the device, perhaps dangerous to set it to 100%

I will check my code and try to reduce map/unmap operations. Now it does 3 maps back to back, I wonder if it has something to do with that because the BufferBandwidth program which is bundled seems to work pretty nicely.

0 Likes

Well, I am not able to find why my code is working much slower. I have reduced number of consequtive maps and unmaps to a single map/unmap and still it is slow.

I now found out that it also does something strange if the size is over 900mb for a single buffer object? (updated single map / single unmap test is attached). Still the new code works on Nvidia devices over 900mb buffer object fine... I dont quite undestand why it doesnt like AMD OpenCL SDK.

I would appreciate any feedback.

nvidia

$ ./test_objalloc

Size: 225000000 x 4 = 900000000 bytesmax sizet 18446744073709551615

Tesla M2050

WALL time for CL_MEM_ALLOC_HOST_PTR =   0.00 seconds

CL_MAP_WRITE 0

WALL time for Map #0 =   0.19 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for Unmap =   0.20 seconds

Unmap #events: 1 time: 0.0000 seconds.

CL_MAP_WRITE 1

WALL time for Map #1 =   0.32 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for Unmap =   0.19 seconds

Unmap #events: 1 time: 0.0000 seconds.

CL_MAP_READ 0

WALL time for Map #0 =   0.32 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for Unmap =   0.05 seconds

Unmap #events: 1 time: 0.0000 seconds.

CL_MAP_READ 1

WALL time for Map #1 =   0.32 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for Unmap =   0.05 seconds

Unmap #events: 1 time: 0.0000 seconds.

Mapping with CL_MAP_WRITE and writing to mapped area

WALL time for Map #1 =   0.32 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for write mapped area =   0.36 seconds

WALL time for Unmap =   0.20 seconds

Unmap #events: 1 time: 0.0000 seconds.

Allocating memory using memalign 4096

WALL time for memalign =   0.00 seconds

WALL time for CL_MEM_USE_HOST_PTR =   0.21 seconds

CL_MAP_WRITE 0

WALL time for Map #0 =   0.16 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for Unmap =   0.15 seconds

Unmap #events: 1 time: 0.0000 seconds.

CL_MAP_WRITE 1

WALL time for Map #1 =   0.14 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for Unmap =   0.15 seconds

Unmap #events: 1 time: 0.0000 seconds.

CL_MAP_READ 0

WALL time for Map #0 =   0.14 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for Unmap =   0.00 seconds

Unmap #events: 1 time: 0.0000 seconds.

CL_MAP_READ 1

WALL time for Map #1 =   0.14 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for Unmap =   0.00 seconds

Unmap #events: 1 time: 0.0000 seconds.

Mapping with CL_MAP_WRITE and writing to mapped area

WALL time for Map #1 =   0.14 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for write mapped area =   0.22 seconds

WALL time for Unmap =   0.15 seconds

Unmap #events: 1 time: 0.0000 seconds.

amd

$ ./test_objalloc

Size: 225000000 x 4 = 900000000 bytesmax sizet 18446744073709551615

Tahiti

WALL time for CL_MEM_ALLOC_HOST_PTR =   0.00 seconds

CL_MAP_WRITE 0

WALL time for Map #0 =   0.00 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for Unmap =   0.95 seconds

Unmap #events: 1 time: 0.0000 seconds.

CL_MAP_WRITE 1

WALL time for Map #1 =   0.00 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for Unmap =   0.57 seconds

Unmap #events: 1 time: 0.0000 seconds.

CL_MAP_READ 0

WALL time for Map #0 =   0.00 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for Unmap =   0.00 seconds

Unmap #events: 1 time: 0.0000 seconds.

CL_MAP_READ 1

WALL time for Map #1 =   0.00 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for Unmap =   0.00 seconds

Unmap #events: 1 time: 0.0000 seconds.

Mapping with CL_MAP_WRITE and writing to mapped area

WALL time for Map #1 =   0.00 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for write mapped area =   0.54 seconds

WALL time for Unmap =   0.58 seconds

Unmap #events: 1 time: 0.0000 seconds.

Allocating memory using memalign 4096

WALL time for memalign =   0.00 seconds

WALL time for CL_MEM_USE_HOST_PTR =   0.07 seconds

CL_MAP_WRITE 0

WALL time for Map #0 =   0.00 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for Unmap =   0.71 seconds

Unmap #events: 1 time: 0.0000 seconds.

CL_MAP_WRITE 1

WALL time for Map #1 =   0.00 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for Unmap =   0.58 seconds

Unmap #events: 1 time: 0.0000 seconds.

CL_MAP_READ 0

WALL time for Map #0 =   0.00 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for Unmap =   0.00 seconds

Unmap #events: 1 time: 0.0000 seconds.

CL_MAP_READ 1

WALL time for Map #1 =   0.00 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for Unmap =   0.00 seconds

Unmap #events: 1 time: 0.0000 seconds.

Mapping with CL_MAP_WRITE and writing to mapped area

WALL time for Map #1 =   0.00 seconds

Map #events: 1 time: 0.0000 seconds.

WALL time for write mapped area =   0.54 seconds

WALL time for Unmap =   0.58 seconds

Unmap #events: 1 time: 0.0000 seconds.

0 Likes

Hi yurtesen,

I can't compile your code successfully. I need more time to check it.

0 Likes

It should compile successfully using the libraries/includes provided in the file. (at least one other person could compile it here). Allso I am able to compile it easily on Fedora 16, Fedora 10, Ubuntu 11.x, Ubuntu 12.x and Scientific Linux 6 (SL6) Can you tell what problem you are having?

0 Likes

test_objalloc.cpp: In function ‘void mapwrite()’:

test_objalloc.cpp:207:17: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]

test_objalloc.cpp:205:24: warning: ‘i’ is used uninitialized in this function [-Wuninitialized]

/tmp/cc8O7n4J.o: In function `cl_checkelapsedtime(std::vector<cl::Event, std::allocator<cl::Event> >, char const*, int)':

test_objalloc.cpp:(.text+0xb6): undefined reference to `std::cout'

test_objalloc.cpp:(.text+0xbb): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)'

test_objalloc.cpp:(.text+0xca): undefined reference to `std::cout'

test_objalloc.cpp:(.text+0xcf): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)'

test_objalloc.cpp:(.text+0xd7): undefined reference to `std::cout'

test_objalloc.cpp:(.text+0xdf): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<unsigned long>(unsigned long)'

test_objalloc.cpp:(.text+0xf4): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)'

test_objalloc.cpp:(.text+0x10e): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<double>(double)'

test_objalloc.cpp:(.text+0x123): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)'

test_objalloc.cpp:(.text+0x150): undefined reference to `std::basic_ostream<char, std::char_traits<char> >::put(char)'

test_objalloc.cpp:(.text+0x158): undefined reference to `std::basic_ostream<char, std::char_traits<char> >::flush()'

test_objalloc.cpp:(.text+0x174): undefined reference to `std::ctype<char>::_M_widen_init() const'

test_objalloc.cpp:(.text+0x1b3): undefined reference to `std::cout'

test_objalloc.cpp:(.text+0x1be): undefined reference to `std::cout'

test_objalloc.cpp:(.text+0x1c9): undefined reference to `std::basic_ios<char, std::char_traits<char> >::clear(std::_Ios_Iostate)'

test_objalloc.cpp:(.text+0x1d3): undefined reference to `std::__throw_bad_cast()'

/tmp/cc8O7n4J.o: In function `checkErr(int, char const*)':

test_objalloc.cpp:(.text+0x24b): undefined reference to `std::cerr'

test_objalloc.cpp:(.text+0x250): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'

test_objalloc.cpp:(.text+0x25b): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'

test_objalloc.cpp:(.text+0x268): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'

test_objalloc.cpp:(.text+0x272): undefined reference to `std::basic_ostream<char, std::char_traits<char> >::operator<<(int)'

test_objalloc.cpp:(.text+0x27f): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'

test_objalloc.cpp:(.text+0x28c): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'

test_objalloc.cpp:(.text+0x297): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'

test_objalloc.cpp:(.text+0x2a4): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'

test_objalloc.cpp:(.text+0x2ac): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::endl<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&)'

/tmp/cc8O7n4J.o: In function `map(int, unsigned long)':

test_objalloc.cpp:(.text+0x483): undefined reference to `operator new(unsigned long)'

test_objalloc.cpp:(.text+0x524): undefined reference to `operator delete(void*)'

test_objalloc.cpp:(.text+0x594): undefined reference to `operator delete(void*)'

test_objalloc.cpp:(.text+0x5de): undefined reference to `operator delete(void*)'

test_objalloc.cpp:(.text+0x600): undefined reference to `operator delete(void*)'

test_objalloc.cpp:(.text+0x61e): undefined reference to `std::__throw_bad_alloc()'

test_objalloc.cpp:(.text+0x635): undefined reference to `__cxa_begin_catch'

test_objalloc.cpp:(.text+0x650): undefined reference to `__cxa_rethrow'

test_objalloc.cpp:(.text+0x658): undefined reference to `__cxa_end_catch'

/tmp/cc8O7n4J.o: In function `unmap()':

test_objalloc.cpp:(.text+0x7f4): undefined reference to `operator new(unsigned long)'

test_objalloc.cpp:(.text+0x894): undefined reference to `operator delete(void*)'

test_objalloc.cpp:(.text+0x903): undefined reference to `operator delete(void*)'

test_objalloc.cpp:(.text+0x943): undefined reference to `__cxa_begin_catch'

test_objalloc.cpp:(.text+0x95e): undefined reference to `__cxa_rethrow'

test_objalloc.cpp:(.text+0x968): undefined reference to `std::__throw_bad_alloc()'

test_objalloc.cpp:(.text+0x970): undefined reference to `__cxa_end_catch'

test_objalloc.cpp:(.text+0x97f): undefined reference to `operator delete(void*)'

test_objalloc.cpp:(.text+0x9bc): undefined reference to `operator delete(void*)'

/tmp/cc8O7n4J.o: In function `std::vector<cl::Device, std::allocator<cl::Device> >::~vector()':

test_objalloc.cpp:(.text._ZNSt6vectorIN2cl6DeviceESaIS1_EED2Ev[_ZNSt6vectorIN2cl6DeviceESaIS1_EED5Ev]+0x5a): undefined reference to `operator delete(void*)'

test_objalloc.cpp:(.text._ZNSt6vectorIN2cl6DeviceESaIS1_EED2Ev[_ZNSt6vectorIN2cl6DeviceESaIS1_EED5Ev]+0x3f): undefined reference to `operator delete(void*)'

/tmp/cc8O7n4J.o: In function `std::vector<cl::Event, std::allocator<cl::Event> >::~vector()':

test_objalloc.cpp:(.text._ZNSt6vectorIN2cl5EventESaIS1_EED2Ev[_ZNSt6vectorIN2cl5EventESaIS1_EED5Ev]+0x5a): undefined reference to `operator delete(void*)'

/tmp/cc8O7n4J.o:test_objalloc.cpp:(.text._ZNSt6vectorIN2cl5EventESaIS1_EED2Ev[_ZNSt6vectorIN2cl5EventESaIS1_EED5Ev]+0x3f): more undefined references to `operator delete(void*)' follow

/tmp/cc8O7n4J.o: In function `std::vector<cl::Event, std::allocator<cl::Event> >::_M_insert_aux(__gnu_cxx::__normal_iterator<cl::Event*, std::vector<cl::Event, std::allocator<cl::Event> > >, cl::Event const&)':

test_objalloc.cpp:(.text._ZNSt6vectorIN2cl5EventESaIS1_EE13_M_insert_auxEN9__gnu_cxx17__normal_iteratorIPS1_S3_EERKS1_[std::vector<cl::Event, std::allocator<cl::Event> >::_M_insert_aux(__gnu_cxx::__normal_iterator<cl::Event*, std::vector<cl::Event, std::allocator<cl::Event> > >, cl::Event const&)]+0x17e): undefined reference to `operator new(unsigned long)'

test_objalloc.cpp:(.text._ZNSt6vectorIN2cl5EventESaIS1_EE13_M_insert_auxEN9__gnu_cxx17__normal_iteratorIPS1_S3_EERKS1_[std::vector<cl::Event, std::allocator<cl::Event> >::_M_insert_aux(__gnu_cxx::__normal_iterator<cl::Event*, std::vector<cl::Event, std::allocator<cl::Event> > >, cl::Event const&)]+0x244): undefined reference to `operator delete(void*)'

test_objalloc.cpp:(.text._ZNSt6vectorIN2cl5EventESaIS1_EE13_M_insert_auxEN9__gnu_cxx17__normal_iteratorIPS1_S3_EERKS1_[std::vector<cl::Event, std..

Any ideas?

0 Likes

Can you tell the command line you are giving to compiler? Are you sure that you are using c++ compiler? Which compiler are you using? Which version of the compiler? Did you open all the files to the same directory?

Can you try:

g++ -O3 -g -I./include test_objalloc.cpp -L/opt/AMDAPP/lib -L./lib -lOpenCL -o test_objalloc

0 Likes

Sorry, I'm just so stupid! However, I got the result from HD6970, but I reduced the size.

Size: 22500000 x 4 = 90000000 bytesmax sizet 18446744073709551615
Cypress
WALL time for CL_MEM_ALLOC_HOST_PTR =   0.00 seconds
CL_MAP_WRITE 0
WALL time for Map #0 =   0.04 seconds
Map #events: 1 time: 0.0000 seconds.
WALL time for Unmap =   0.00 seconds
Unmap #events: 1 time: 0.0000 seconds.
CL_MAP_WRITE 1
WALL time for Map #1 =   0.00 seconds
Map #events: 1 time: 0.0000 seconds.
WALL time for Unmap =   0.00 seconds
Unmap #events: 1 time: 0.0000 seconds.
CL_MAP_WRITE 2
WALL time for Map #2 =   0.00 seconds
Map #events: 1 time: 0.0000 seconds.
WALL time for Unmap =   0.00 seconds
Unmap #events: 1 time: 0.0000 seconds.
CL_MAP_READ 0
WALL time for Map #0 =   0.00 seconds
Map #events: 1 time: 0.0000 seconds.
WALL time for Unmap =   0.00 seconds
Unmap #events: 1 time: 0.0000 seconds.
CL_MAP_READ 1
WALL time for Map #1 =   0.00 seconds
Map #events: 1 time: 0.0000 seconds.
WALL time for Unmap =   0.00 seconds
Unmap #events: 1 time: 0.0000 seconds.
CL_MAP_READ 2
WALL time for Map #2 =   0.00 seconds
Map #events: 1 time: 0.0000 seconds.
WALL time for Unmap =   0.00 seconds
Unmap #events: 1 time: 0.0000 seconds.
Mapping with CL_MAP_WRITE and writing to mapped area
WALL time for Map #1 =   0.00 seconds
Map #events: 1 time: 0.0000 seconds.
WALL time for write mapped area =   0.03 seconds
WALL time for Unmap =   0.00 seconds
Unmap #events: 1 time: 0.0000 seconds.

Allocating memory using memalign 4096
WALL time for memalign =   0.00 seconds
WALL time for CL_MEM_USE_HOST_PTR =   0.03 seconds
CL_MAP_WRITE 0
WALL time for Map #0 =   0.03 seconds
Map #events: 1 time: 0.0000 seconds.
WALL time for Unmap =   0.00 seconds
Unmap #events: 1 time: 0.0000 seconds.
CL_MAP_WRITE 1
WALL time for Map #1 =   0.00 seconds
Map #events: 1 time: 0.0000 seconds.
WALL time for Unmap =   0.00 seconds
Unmap #events: 1 time: 0.0000 seconds.
CL_MAP_WRITE 2
WALL time for Map #2 =   0.00 seconds
Map #events: 1 time: 0.0000 seconds.
WALL time for Unmap =   0.00 seconds
Unmap #events: 1 time: 0.0000 seconds.
CL_MAP_READ 0
WALL time for Map #0 =   0.00 seconds
Map #events: 1 time: 0.0000 seconds.
WALL time for Unmap =   0.00 seconds
Unmap #events: 1 time: 0.0000 seconds.
CL_MAP_READ 1
WALL time for Map #1 =   0.00 seconds
Map #events: 1 time: 0.0000 seconds.
WALL time for Unmap =   0.00 seconds
Unmap #events: 1 time: 0.0000 seconds.
CL_MAP_READ 2
WALL time for Map #2 =   0.00 seconds
Map #events: 1 time: 0.0000 seconds.
WALL time for Unmap =   0.00 seconds
Unmap #events: 1 time: 0.0000 seconds.
Mapping with CL_MAP_WRITE and writing to mapped area
WALL time for Map #1 =   0.00 seconds
Map #events: 1 time: 0.0000 seconds.
WALL time for write mapped area =   0.03 seconds
WALL time for Unmap =   0.00 seconds
Unmap #events: 1 time: 0.0000 seconds.

0 Likes

Wenju, those results are wrong. Try the first program (in first post) with 3 objects, I think there is a 2nd problem which does not let allocation of large amount of memory in a single buffer. (It appears it does nothing when the size is too large).

0 Likes

Max memory allocation:  536870912, I used max size. So there is nothing wrong when you using 15000*15000? I failed.

This time I test it on 7970.

Size: 134217728 x 4 = 536870912 bytes * 3 = 1610612736
Tahiti
WALL time for CL_MEM_ALLOC_HOST_PTR =   0.00 seconds
CL_MAP_WRITE 0
WALL time for Map #0 =   0.05 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap =   1.17 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_WRITE 1
WALL time for Map #1 =   0.00 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap =   0.70 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_READ 0
WALL time for Map #0 =   0.00 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap =   0.00 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_READ 1
WALL time for Map #1 =   0.00 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap =   0.00 seconds
Unmap #events: 3 time: 0.0000 seconds.
Mapping with CL_MAP_WRITE and writing to mapped area
WALL time for Map #1 =   0.00 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for write mapped area =   0.54 seconds
WALL time for Unmap =   0.77 seconds
Unmap #events: 3 time: 0.0000 seconds.

Allocating memory using memalign 4096
WALL time for memalign =   0.00 seconds
WALL time for CL_MEM_USE_HOST_PTR =   0.09 seconds
CL_MAP_WRITE 0
WALL time for Map #0 =   0.01 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap =   0.97 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_WRITE 1
WALL time for Map #1 =   0.00 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap =   0.69 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_READ 0
WALL time for Map #0 =   0.00 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap =   0.00 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_READ 1
WALL time for Map #1 =   0.00 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap =   0.00 seconds
Unmap #events: 3 time: 0.0000 seconds.
Mapping with CL_MAP_WRITE and writing to mapped area
WALL time for Map #1 =   0.00 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for write mapped area =   0.54 seconds
WALL time for Unmap =   0.77 seconds
Unmap #events: 3 time: 0.0000 seconds.

0 Likes

I am not sure if AMD's implementation does some trick and do not copy from device to host because there was no kernel ran in the device in my test case (which would be smart), maybe thats why it does not transfer it in?

Yes, if the memory object is created by CL_MEM_USE_HOST_PTR/CL_MEM_ALLOC_HOST_PTR, the pointer which clEnqueueMapBuffer/clEnqueueMapImage returns will be all the same. And before data transfer, the runtime will track it, whether the pointer is a new one. If it's a new one, it'll not transfer the data.

0 Likes

If I you map for reading, the implementation copies it from device to host. I am not sure what is wrong with the 'single object' version that I made exactly. The attached program should work on Cypress and my results on Cypress are like this (about 1.2gb/sec):

Size: 60000000 x 4 = 240000000 bytes * 3 = 720000000

Cypress

WALL time for CL_MEM_ALLOC_HOST_PTR =   0.00 seconds

CL_MAP_WRITE 0

WALL time for Map #0 =   0.14 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   0.48 seconds

Unmap #events: 3 time: 0.0000 seconds.

CL_MAP_WRITE 1

WALL time for Map #1 =   0.00 seconds

Map #events: 3 time: 0.0000 seconds.

WALL time for Unmap =   0.35 seconds

Unmap #events: 3 time: 0.0000 seconds.

...

...

Anyway, the point was why the transfers are so slow...

0 Likes