Archives Discussions

yurtesen · ‎05-27-2012

According to AMD OpenCL Programming Guide, CL_MEM_USE_HOST_PTR should cause pre-pinned memory and this is suppose to be efficient. I am testing the following on Tahiti (on a mobo with PCIe 2.x) However I am getting strange results.

I have 2 buffers created with CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR , and 1 buffer is with CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR each consist of a 15000x15000 float dense matrix. Totaling to a ~1.8GB. I am running clAmdBlasSgemm on these... then loading the result back to host memory. I am using enqeue write/read buffer commands,with blocking and also tried nonblocking.

The result with events is: writing matrix1 0.1638seconds matrix2 0.1631seconds clAmdBlasSgemm 4.137 seconds 0.1448 seconds and according to my host code, it takes about 5.8 seconds to accomplish all these (wall time). If I change blocking to nonblocking but put clFinish after each operation, I get write 1.735 and 1598 seconds, sgemm, 4.138 seconds and read 0.1452 seconds. In either case these do not total up to 5.8 seconds.

If I get rid of CL_MEM_USE_HOST_PTR from objects then results are, writing (total) 0.3458 seconds, sgemm 4.137 seconds and reading 0.2411 seconds and wall time shows 4.9 seconds. I tried non_blocking read/write with clFinish() as well and got exactly same results from events.

So, without CL_MEM_USE_HOST_PTR, things go much quicker??? Is there a mistake in the manual???

I also tried the SDK BufferBandwidth example and this does not use HOST_PTR either...

$ ./BufferBandwidth -t 4
Device 0            Tahiti
Build:               DEBUG
GPU work items:      32768
Buffer size:         33554432
CPU workers:         1
Timing loops:        20
Repeats:             1
Kernel loops:        1
inputBuffer:         CL_MEM_READ_ONLY
outputBuffer:        CL_MEM_WRITE_ONLY
AVERAGES (over loops 2 - 19, use -l for complete log)
--------
          PCIe B/W device->host: 0.005904 s       5.68 GB/s
          PCIe B/W host->device: 0.005211 s       6.44 GB/s
Passed!

Thanks,

Evren

sh2 · ‎05-27-2012

Without CL_MEM_USE_HOST_PTR buffer is allocated in device local memory (video memory). And local memory is much faster than pci-e bus (200Gb/s vs 16Gb/s).

yurtesen · ‎05-27-2012

After long thinking, that makes a little bit sense. But does it really allocate space in host memory? It just uses the already allocate space, isnt it so?

If I have pointer A where my data is in, and if I make CL_A mem object using host pointer A. If it is using the already allocated space, does this mean if I do enqueuewrite A -> CL_A, it writes data over itself? What does it do?

I removed my enqueuewrite statements just for fun and it took even longer for program to complete yet the results looked ok. So it is a little bit confusing? I thought it would get faster since I wouldnt be unnecessarily copying data over itself!?

There are just too many types of memory allocations in OpenCL!

yurtesen · ‎05-27-2012

and of course if the buffer is in GPU device memory if HOST_PTR is not used, then what is the purpose of CL_MEM_PERSISTENT_MEM_AMD ?

sh2 · ‎05-28-2012

Memory allocated with CL_MEM_PERSISTENT_MEM_AMD flag is visible to the CPU.

yurtesen · ‎05-28-2012

So you are saying memory allocated by CL_MEM_USE_HOST_PTR is not visible to CPU?

yurtesen · ‎05-28-2012

sh, I checked the AMD OpenCL Guide 4-17 there is a table 4.2 named "OpenCL Memory Object Properties" There it says the location of the buffer is Device Memory when CL_MEM_USE_HOST_PTR is used.

In addition, the table shows the default is if data is smaller than 32MB then pinned host memory, and if it is larger than 32MB then host memory only (I understand, not pinned).

Dont you think that this means CL_MEM_USE_HOST_PTR should give improved performance? It is the same as default allocation + it pins the memory location for faster transfers?

sh2 · ‎05-28-2012

Most likely you have pci-e 2.0. So ~6Gb/s is maximum bandwidth.This code can't be optimized further.

yurtesen · ‎05-28-2012

The question was not about the maximum performance achieved. The question is why CL_MEM_USE_HOST_PTR causes slower execution.

notzed · ‎05-28-2012

USE_HOST_PTR just means that if you use map/unmap the driver will always have to copy the data to/from the exact same heap memory pointer you passed to it. i.e. there is no possibility of zero-copy, and there may other be deterimental effects due to mis-alignment. Apart from conveniently loading constants or interacting with pre-existing systems and their memory directly, USE_HOST_PTR doesn't seem to me to be a good choice to use for general buffer allocation.

ALLOC_HOST_PTR will let the driver do the malloc, and map/unmap is mostly just that (although the data still needs to get from the device for discrete parts). And everything will be aligned ideally. Potentially the memory is zero-copy and/or directly pinned as the driver can create new virtual memory spaces compatible with both the device and host.

MEM_USE_PERSISTENT_MEM_AMD will put the buffer on the device and let the host read/write it directly: possibly slower whole updates of large contiguous arrays but definitely better for sparse CPU-side changes. I don't know much about this though, i'm just going from the guide, e.g. table 4.2.

See table 4.3 of the AMD programming guide and the following couple of paragraphs, and actually pretty much all of section 4.4 in all of it's gritty - but fairly readable - detail. e.g. following table 4.2, section 4.4.1 Host Memory and the next few sections go into detail.

And things like pinned/zero copy/device visible memory are limited and constrained by hardware and operating systems so can't always be used even if everything else is compatible.

yurtesen · ‎05-29-2012

notzed, thanks for detailed answer. As from the manual, I understand that CL_MEM_ALLOC_HOST_PTR will allocate the memory in host memory and it uses zero copy (at least on tahiti) and map/unmap will be a no-op. Is this not correct?

Please have a look at the code (which does nothing but allocate mem object map / unmap) below and its output:

// Test
    // For calculating time difference
    #define TVDIFF(tv1, tv2) ((tv2).tv_sec - (tv1).tv_sec \
        + ((tv2).tv_usec - (tv1).tv_usec) * 1E-6)
    #include <iostream>
    #include <fstream>
    #include <CL/cl.h>
    #include <CL/cl_ext.h>
    #include <CL/cl.hpp>
    #include <sys/time.h>
    using namespace std;
    int main() {
        struct timeval  tv1, tv2;
        int size=11000*11000;
        cl_int err;
        std::vector<cl::Platform> platforms;
        cl::Platform::get(&platforms);
        cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platforms[0](), 0};
        cl::Context context(CL_DEVICE_TYPE_GPU, properties);
        std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
        for(int i=0;i<devices.size();i++) cout << devices[0].getInfo<CL_DEVICE_NAME>() << endl;
        cl_command_queue_properties queue_prop = 0;
        cl::CommandQueue clqueue(context, devices[0], queue_prop, &err);
        float *p_X,*p_Y,*p_Z;
        cl::Buffer cl_X = cl::Buffer(context, CL_MEM_ALLOC_HOST_PTR,  sizeof(float) * size, NULL, &err);
        cl::Buffer cl_Y = cl::Buffer(context, CL_MEM_ALLOC_HOST_PTR,  sizeof(float) * size, NULL, &err);
        cl::Buffer cl_Z = cl::Buffer(context, CL_MEM_ALLOC_HOST_PTR,  sizeof(float) * size, NULL, &err);
        p_X = (float *) clqueue.enqueueMapBuffer(cl_X, CL_TRUE, CL_MAP_WRITE, 0, sizeof(float) * size, NULL, NULL, &err);
        p_Y = (float *) clqueue.enqueueMapBuffer(cl_Y, CL_TRUE, CL_MAP_WRITE, 0, sizeof(float) * size, NULL, NULL, &err);
        p_Z = (float *) clqueue.enqueueMapBuffer(cl_Z, CL_TRUE, CL_MAP_WRITE, 0, sizeof(float) * size, NULL, NULL, &err);
        for(int i=0;i<size;i++)p_X = 1.0;
        for(int i=0;i<size;i++)p_Y = 2.0;
        for(int i=0;i<size;i++)p_Z = 3.0;
        gettimeofday(&tv1, NULL);     /* Start measuring time */
        clqueue.enqueueUnmapMemObject(cl_X, p_X, NULL, NULL);
        clqueue.enqueueUnmapMemObject(cl_Y, p_Y, NULL, NULL);
        clqueue.enqueueUnmapMemObject(cl_Z, p_Z, NULL, NULL);
        clqueue.finish();
        gettimeofday(&tv2, NULL);     /* Stop measuring time */
        printf("\nWALL time for unmap = %6.1f seconds\n\n", TVDIFF(tv1,tv2) );
        return 0;
    }

Output:

Tahiti
WALL time for unmap = 1.3 seconds

Can you tell why no-op is using 1.3 seconds to complete? Is there a mistake in the code?

notzed · ‎05-29-2012

notzed · ‎05-30-2012

what the duece ... email reply still doesn't work.

I sent this:

On 30/05/12 06:43, yurtesen wrote:

AMD Developer Forums <http://devgurus.amd.com/index.jspa>

Re: CL_MEM_USE_HOST_PTR slower than not using it...

in /OpenCL/
------------------------------------------------------------------------

notzed, thanks for detailed answer. As from the manual, I understand
that CL_MEM_ALLOC_HOST_PTR will allocate the memory in host memory and
it uses zero copy (at least on tahiti) and map/unmap will be a no-op. Is
this not correct?

I think you missed the part where the manual says that zero copy is os/hardware constrained. I suspect your ~500MB allocation is exceeding those limits ... you could try looping through a few sizes and see if it scales linearly or suddenly jumps at a specific size.

e.g. section 4.5.1.2 "Pinned Memory"

"The runtime limits the total amount of pinned host memory that can be used for memory objects".

I noticed the AMD programming guide was updated recently so I had a look at the 2.1a version and it has some more info.

Section 4.6.2.2 has some stuff about zero-copy, and although it doesn't mention limits for ALLOC_HOST_PTR for USE_PERSISTENT_MEM_AMD it mentions 64MB per buffer limit.

Remember a discrete part will always need to copy the data at some point, and if you're doing reads and writes of the whole buffer each time just doing a batched copy might be more efficient. These zero-copy buffers have other performance characteristics, e.g. any memory that ends up uncached on the cpu could be a massive performance hit if you need random writes or reads from the CPU.

Z

yurtesen · ‎05-30-2012

I am not convinced, because it doesnt make sense Where is it going to copy the data? The area is in host (uses CL_MEM_ALLOC_HOST_PTR) memory, and I am mapping it, which should simply give me the host pointer, and simply whatever I write there goes directly to where it should be. Therefore unmapping shouldnt need to copy anything.

If you check 4.6.2.2 carefully, you can see that 64MB limit is under CL_MEM_USE_PERSISTENT_MEM_AMD section only and indeed I am not able to allocate huge memory areas using that option.

The section 4.5.1.2 might be referring to automatical pinning of data under 32MB. However it is not very clear. Also the memory being pinned or not shouldnt have effect on zero-copy feature (there is no mention of that?

The example was purely for proving the point that zero-copy does not seem to be functioning properly. Of course if I code a real application, it will utilize whatever goes best for the application pattern.

notzed · ‎05-30-2012

yurtesen wrote:
I am not convinced, because it doesnt make sense Where is it going to copy the data? The area is in host (uses CL_MEM_ALLOC_HOST_PTR) memory, and I am mapping it, which should simply give me the host pointer, and simply whatever I write there goes directly to where it should be. Therefore unmapping shouldnt need to copy anything.
If you check 4.6.2.2 carefully, you can see that 64MB limit is under CL_MEM_USE_PERSISTENT_MEM_AMD section only and indeed I am not able to allocate huge memory areas using that option.

yeah that's what i said.

The section 4.5.1.2 might be referring to automatical pinning of data under 32MB. However it is not very clear. Also the memory being pinned or not shouldnt have effect on zero-copy feature (there is no mention of that?
The example was purely for proving the point that zero-copy does not seem to be functioning properly. Of course if I code a real application, it will utilize whatever goes best for the application pattern.

Clearly you're not getting zero copy behaviour. The guide just says it 'may' use it, not that it 'will always' use it. I've given a few reasons why it wouldn't be able to.

yurtesen · ‎05-30-2012

I can see that it does not do zero copy and I am thinking this might be a bug???. Unless if somebody comes up with a code which can actually do zero-copy as advertised in the documentation?

yurtesen · ‎05-30-2012

I now tested this on windows, and same program works on 0.0 seconds walltime when unmapping.

So It is a bug in Linux, do you agree? (bug since documentation says it supports zerop-copy on linux also.)

sh2 · ‎05-30-2012

Could you lock this amount of pinned memory manually?

http://linux.die.net/man/2/mlock

yurtesen · ‎05-30-2012

I am not sure if a no-op operation cares about such stuff???

sh2 · ‎05-30-2012

Because zero copy can't be done with ordinary (unpinned) memory.

yurtesen · ‎05-30-2012

Why should it be pinned? That does not make much sense in my mind.... (and not mentioned in manual)

Anyway I just went ahead and set the ulimit -l to 8GB and still the result is exactly same (1.3 seconds)

max locked memory

(kbytes, -l) 8388608

yurtesen · ‎05-31-2012

I think CL_MEM_ALLOC_HOST_PTR might as well be copying data to the device memory. It appears when I use this option, I am not able to allocate as much memory as I could from the device anymore.

kknox · ‎06-01-2012

Hi Yurtesen~

When using CL_MEM_USE_HOST_PTR, make sure to align your memory buffer on a page boundary, typically 4K. You may be underestimating the performance penalty of a misaligned buffer. Take a look at section 4.5.4.2 in the OpenCL Programming guide.

When you create a buffer with CL_MEM_USE_HOST_PTR, the runtime is still free to create cached copies in device memory. Your data always exists in host memory, but for kernel execution speed the device may cache a copy of your buffer. Take a look at section 4.6.2.3. To make sure that you have the most recent copy of data after executing a kernel, you should first call MapBuffer to synchronize host memory with device memory, and then read host memory.

Hope this helps,

Kent

yurtesen · ‎06-02-2012

Hello Kent,

I figured that CL_MEM_USE_HOST_PTR data is copied to device. However if the CL_MEM_USE_HOST_PTR is not used, data is copied to device anyway (what I mean is, host_ptr option should speed things up, not slow them down). According to documentation using CL_MEM_USE_HOST_PTR should pin the memory which should improve performance, but it appears to be taking longer time. (anyway, I should run more tests on this to give you more definitive explanation but see below).

Anyway, for creating buffers which are in the host, I made a test program (see above) which uses CL_MEM_ALLOC_HOST_PTR. Now I can understand that this can be cached in the device memory, why not... but it takes 1.3 seconds to map/unmap roughly 500mb of data! Something is clearly wrong here... (even if the whole data was copied, it should have taken much less time)

Moreover, I made a test program where I copy the data (using enqueueCopy) from a object allocated with CL_MEM_ALLOC_HOST_PTR to an object allocated without it (an object which is in device memory). The speed was roughly about 90GB/sec. So I figure the data which is not suppose to be copied to device memory at some point (or cached there), and this operation was somehow very time consuming. I would expect caching to take no time at all.

Moreover, if I allocate data with CL_MEM_ALLOC_HOST_PTR, the amount I can allocate in device without CL_MEM_ALLOC_HOST_PTR is decreasing. If the data allocated with CL_MEM_ALLOC_HOST_PTR is in host memory, why shouldnt I be able to allocate the same amount in the device anymore?

Anyway, this is the situation. Can you explain why the map/unmap in the code above in my previous post would take 1.3 seconds to complete?

gautam_himanshu · ‎06-03-2012

Hi yurtsen,

1. I would recommend you to use a clFinish() call before starting to measure the time of data transfer. A lot of other set-up time might also be included in your 1.3s otherwise. Or use events for enqueueWriteBuffers to get precision timing.

2. I don't think you are using the recommended path as suggested in Programming guide( unless it has changes recently ) , You appear to be having the requirement of writing inside openCL buffers, whereas CL_MEM_ALLOC_HOST_PTR is a recommended flag when you need to read data from a openCL buffer(bacause of WC feature for such buffers).

3. In the situation above, i would recommend to use CL_MEM_USE_PERSISTENT_MEM_AMD, which allocates buffer in host accessible device memory. Writing to such a buffer should happen at peak interconnect speed.

yurtesen · ‎06-04-2012

gautam.himanshu wrote:
1. I would recommend you to use a clFinish() call before starting to measure the time of data transfer. A lot of other set-up time might also be included in your 1.3s otherwise. Or use events for enqueueWriteBuffers to get precision timing.

In this case, it is unnecessary since map operation is called with blocking flag true and there are no other OpenCL calls between the map until the timing starts. (Note that unnecessary clFinish() statements will make your code slower also.).

In either case, I put a clFinish() before the first gettimeofday call and the results are the same.

I found out that event times are unreliable and should not be used for measuring program performance. They seem to tell what you want to see only The actual runtime of the program code on the other hand is different. I made a more elaborate example to show it. I will put a 2nd example code which prints out all the timings separately. It appears the event timer does not include some of the time!

gautam.himanshu wrote:
2. I don't think you are using the recommended path as suggested in Programming guide( unless it has changes recently ) , You appear to be having the requirement of writing inside openCL buffers, whereas CL_MEM_ALLOC_HOST_PTR is a recommended flag when you need to read data from a openCL buffer(bacause of WC feature for such buffers).

I am not writing to the buffer from OpenCL kernel, I am writing from host code. The best performance is achieved if host code writes to pinned memory location in host. I know the part you mention in the manual, but it is not talking about about reading/writing to area using host code. This is the recommended usage for the example code.

gautam.himanshu wrote:
3. In the situation above, i would recommend to use CL_MEM_USE_PERSISTENT_MEM_AMD, which allocates buffer in host accessible device memory. Writing to such a buffer should happen at peak interconnect speed.

That would cause the memory location to reside on device memory. The peak interconnect speak in this case is the PCIe bandwidth, which is much less than using the CL_MEM_*_HOST_PTR which allows full memory bandwidth to write to the location by CPU.

See:

http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programmin...

Page 4-14 Table 4.1

You can see that CPU accessing the device memory is extremely slow. In the code I used, only CPU accesses the memory location. Do you see the problem? (and that table is made based on PCIe3.0 systems)

yurtesen · ‎06-04-2012

$ g++ test.cpp -I./include -lOpenCL

$ time ./a.out

Tahiti

WALL time for map = 0.0 seconds

map #events: 3 time: 2e-09 seconds.

WALL time for for loops = 1.5 seconds

WALL time for unmap = 1.4 seconds

Unmap #events: 3 time: 2e-09 seconds.

real 0m3.765s

user 0m1.831s

sys 0m1.220s

$

As you can see, it appears the OpenCL code is trying to copy the memory to a different location at unmap (assumed from the time the operation takes). However, the manual clearly says this operation should be a zero-copy/no-op operation. Therefore there is a problem. If you think that there is something wrong in the code, please quote programming guide page number and title (so I can go and read it ). In addition, unmap takes 0 seconds on windows system using the same code. Which means the zero-copy/no-op is working on windows only (although programing guide clearly mentions this working on Linux also!)

Also, you can see that the events are reporting 2e-09 seconds even though the operation took 1.4 seconds. Therefore the event timers are unreliable for measuring actual performance. They dont seem to include all the operations performed....

Updated code is below:

#include <iostream>  
#include <fstream>  
#include <iomanip>
#include <CL/cl.h>
#include <CL/cl_ext.h>
#include <CL/cl.hpp>
#include <sys/time.h>
     
using namespace std;  
      
 // For calculating time difference
#define TVDIFF(tv1, tv2) ((tv2).tv_sec - (tv1).tv_sec \
    + ((tv2).tv_usec - (tv1).tv_usec) * 1E-6)
void cl_checkelapsedtime(std::vector<cl::Event> events, const char *msg, int precision) {
  cl_ulong startTime,endTime;
  // Calculate elapsed time
  cl::Event::waitForEvents(events);
  startTime = events[0].getProfilingInfo<CL_PROFILING_COMMAND_START>();
  endTime = events[events.size()-1].getProfilingInfo<CL_PROFILING_COMMAND_END>();
  cl_ulong kernelExecTimeNs = endTime-startTime;
  std::cout << msg << " #events: " << events.size() <<
                      " time: " << std::setprecision (precision) << (double)kernelExecTimeNs/1000000000 << " seconds."
                   << std::endl;
}
int main() {  
  struct timeval  tv1, tv2;
  int size=11000*11000;
  int device_id=1;
  cl_int err;  
  std::vector<cl::Platform> platforms;  
  cl::Platform::get(&platforms);  
  cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platforms[0](), 0};
  cl::Context context(CL_DEVICE_TYPE_GPU, properties);  
            
  std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();  
  cout << devices[device_id].getInfo<CL_DEVICE_NAME>() << endl;  
      
  cl_command_queue_properties queue_prop = 0;  
  cl::CommandQueue clqueue(context, devices[device_id], queue_prop, &err);  
  float *p_X,*p_Y,*p_Z;
  cl::Buffer cl_X = cl::Buffer(context, CL_MEM_ALLOC_HOST_PTR,  sizeof(float) * size, NULL, &err);
  cl::Buffer cl_Y = cl::Buffer(context, CL_MEM_ALLOC_HOST_PTR,  sizeof(float) * size, NULL, &err);
  cl::Buffer cl_Z = cl::Buffer(context, CL_MEM_ALLOC_HOST_PTR,  sizeof(float) * size, NULL, &err);
  cl::Event event;
  std::vector<cl::Event> clevents;
  clqueue.finish();
  gettimeofday(&tv1, NULL);     /* Start measuring time */
  p_X = (float *) clqueue.enqueueMapBuffer(cl_X, CL_TRUE, CL_MAP_WRITE, 0, sizeof(float) * size, NULL, &event, &err);
  clevents.push_back(event);
  p_Y = (float *) clqueue.enqueueMapBuffer(cl_Y, CL_TRUE, CL_MAP_WRITE, 0, sizeof(float) * size, NULL, &event, &err);
  clevents.push_back(event);
  p_Z = (float *) clqueue.enqueueMapBuffer(cl_Z, CL_TRUE, CL_MAP_WRITE, 0, sizeof(float) * size, NULL, &event, &err);
  clevents.push_back(event);
  clqueue.finish();
  gettimeofday(&tv2, NULL);     /* Stop measuring time */
  printf("WALL time for map = %6.1f seconds\n", TVDIFF(tv1,tv2) );
  cl_checkelapsedtime(clevents,"map",4);
  clevents.clear();
  gettimeofday(&tv1, NULL);     /* Start measuring time */
  for(int i=0;i<size;i++)p_X = 1.0;  
  for(int i=0;i<size;i++)p_Y = 2.0;  
  for(int i=0;i<size;i++)p_Z = 3.0;  
  gettimeofday(&tv2, NULL);     /* Stop measuring time */
  printf("WALL time for for loops = %6.1f seconds\n", TVDIFF(tv1,tv2) );
  gettimeofday(&tv1, NULL);     /* Start measuring time */
  clqueue.enqueueUnmapMemObject(cl_X, p_X, NULL, &event);
  clevents.push_back(event);
  clqueue.enqueueUnmapMemObject(cl_Y, p_Y, NULL, &event);
  clevents.push_back(event);
  clqueue.enqueueUnmapMemObject(cl_Z, p_Z, NULL, &event);
  clevents.push_back(event);
  clqueue.finish();
  gettimeofday(&tv2, NULL);     /* Stop measuring time */
  printf("WALL time for unmap = %6.1f seconds\n", TVDIFF(tv1,tv2) );
  cl_checkelapsedtime(clevents,"Unmap",4);
  clevents.clear();
  return 0;  
}

yurtesen · ‎06-24-2012

It appears there is an undocumented size limit on zero copy buffers on Linux while using CL_MEM_ALLOC_HOST_PTR and if the limit is passed, the SDK silently reverts to a very slow method... (confirmed by amd)