Archives Discussions

nibal · ‎09-17-2015

As discussed in another thread "Optimization Guide Memory Allocation", according to the Optimization guide, when the display driver fglrx supports VM, and data is transferred from the application to the GPU kernel device, this should be a 0-copy when using the appropriate flags in CreateBuffer and use MapBuffer for the transfer. I imagine this to work in other SDKs, since it is written in the guide.

In my case:

clinfo | grep Driver

Driver version: 1445.5 (VM)

I'm using CL_MEM_ALLOC_HOST_PTR in my CreateBuffer and use MapBuffer for the transfer of data. CodeXL reports for the same exactly amount of data:

A) Read/Write Buffers

WriteBuffer: 173 ms for 6241 calls each@.02773 ms

ReadBuffer: 122 ms for 390 calls each@.312 ms

B) Map/Unmap Buffers

MapBuffer: 193 ms for 6630 calls each@.02907 ms

UnmapBuffer: 120 ms for 6630 calls each@0.01811 ms

Notice that actually the sum of Read/WriteBuffer calls is slightly less than the sum of the Map/UnmapBuffer calls, a far cry from the 0-copy it should be.

Plz fix

dipak · ‎09-22-2015

Hi,

Please don't relate catalyst driver issues with APP SDK. In fact, most of the issues you pointed out are related to driver only. Actually APP SDK is only needed to build an OpenCL project as the SDK contains required header files and libraries (and the CPU runtime). On the other hand, all the required GPU packages (e.g. gpu compiler, gpu runtime etc.) come with driver only. Most of the cases, performance and compiler related issues are directly related to driver itself, nothing to do with APP SDK. For the same reason, APP SDK is release once in a while, whereas you get a new Catalyst driver almost each month. Hope I'm able to clear myself.

The discussion you linked uses clpeak, not clinfo. I don't know what clpeak is.
There may be an issue with clinfo reporting from SDK-2.9.1, but I'm using the latest catalyst.

I didn't refer the clPeak project itself. Rather I just wanted to tell you to check the clinfo15.5.zip file which contains clinfo output of catalyst 15.5. Sorry, I didn't say clearly enough.

Yes, it seems that somehow the clinfo link has been broken and it's pointing to somewhere else. It should point to the clinfo comes with the driver package.

A) Read/Write Buffers
WriteBuffer: 173 ms for 6241 calls each@.02773 ms
ReadBuffer: 122 ms for 390 calls each@.312 ms
B) Map/Unmap Buffers
MapBuffer: 193 ms for 6630 calls each@.02907 ms
UnmapBuffer: 120 ms for 6630 calls each@0.01811 ms
Notice that actually the sum of Read/WriteBuffer calls is slightly less than the sum of the Map/UnmapBuffer calls

Are you indicating the total running time? I guess, better to consider the per call cost, not the overall one because number of calls was not same for both the cases.

Now coming to the BufferBandwidth sample. I was running it using a d-gpu card. From my observation, I just want to highlight one case as an example.

(only relevant portions are shown here)

Case 1: Default i.e. without any command line option.

---------------------------------------------------------------------------

outputBuffer: CL_MEM_WRITE_ONLY

3. GPU kernel write to outputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS) | 133.099

4. Host mapped read of outputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- READ (GBPS) | 2.936

---------------------------------------|---------------

CPU read (GBPS) | 4.124

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS) | 462.137

-----------------------------------------------------------------------------------

Case 2: <BufferBandwidth> -of 5

outputBuffer: CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR

3. GPU kernel write to outputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS) | 2.617

4. Host mapped read of outputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- READ (GBPS) | 683.713

---------------------------------------|---------------

CPU read (GBPS) | 3.926

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS) | 266.306

[Note on setup: Windows 64bit, APP SDK 3.0 beta, Hawaii XT card, one of the latest internal driver]

As you can see, buffer mapping became much faster in case of zero copy buffer. However, access time from kernel dropped a lot.

Now, in your case, I'm not sure about the exact usage of the zero-copy buffer. I think, it would be helpful for us if could provide a test-case so that we can check it here. Please also provide your setup details and clinfo output.

Regards,

View solution in original post

dipak · ‎09-18-2015

There is a APP SDK sample named BufferBandwidth which measures bandwidth characteristics of a given system, including GPU memory and interconnect (for example: PCIe) bandwidth, achievable in OpenCL. Could you please run that sample and check whether your observation matches with the sample's output or not?

Another point is, both APP SDK 2.9-1 and driver 1445.5 are quite old, so, I would suggest you to upgrade them both and check the result using latest ones. It's also recommended. Anyhow, the new fixes are available only with newer driver or SDK. Please use with the latest one.

Regards,

nibal · ‎09-19-2015

Np. I realize you are a single person trying to tackle all these issues. So thank you for whatever you can do.

I attach text file bw with tests you asked, map/unmap, read/write. read/write prepinned for both dma and pcie. I would say that my samples are in general agreement.

SDK 2.9-1 is not that old, just the one prior to SDK 3.0. Is SDK 3.0 stable, production ready? I still read issues in the forum.

If not, AMD should support at least one production release.

I thought I had the latest display driver, apparently I missed the last one. Will download and retest.

Hmmm. Installed latest catalyst fglrx-15.201.1151 and guess what:

clinfo | grep Driver

Driver version: 1445.5 (VM)

Exactly the same as before. No sense in retesting. Did you have another driver in mind?

dipak · ‎09-21-2015

Hi,

As I know, SDK 2.9-1 is more than one year old. After the SDK 2.9-1, SDK 3.0 Beta was released at the end of the last year. Then, a more stable version of SDK 3.0 GA was released last month. It is the latest and much stable version than the beta one. You can download it from here http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/

Regarding the Catalyst, I was referring this catalyst version: Desktop

Though the above observation may not be directly related to the driver version, however, I'm little bit confuse with your driver version number. For example, please check the clinfo output of the public driver 15.5 posted at here Drop in fglrx OpenCL performance: 14.12 vs 15.5 . You can see the driver version is 1702.3. Though I'm not sure about the exact number, but the driver version of the latest Catalyst 15.9 is much higher (I guess something 1800+). Could you please check the AMD Catalyst control centre and share the clinfo output?

Regards,

nibal · ‎09-22-2015

SDK 3.0 Beta was a bug finding release, not a production release. SDK 3.0 better be more stable than the Beta, but is it production stable?

In the forum I read issues with using binaries (crashes with printf, issues with logs), compiler problems all in the past 2 weeks.

Of course if i end up using ocl2.0, since there is shared memory between host and GPU, what do I need Map/Unmap for?

Wouldn't that invalidate that whole section in the guide?

The link for the catalyst you gave me is the same one I used to download it: fglrx-15.201.1151 is the latest catalyst.

The discussion you linked uses clpeak, not clinfo. I don't know what clpeak is.

There may be an issue with clinfo reporting from SDK-2.9.1, but I'm using the latest catalyst.

dipak · ‎09-22-2015

Hi,

Please don't relate catalyst driver issues with APP SDK. In fact, most of the issues you pointed out are related to driver only. Actually APP SDK is only needed to build an OpenCL project as the SDK contains required header files and libraries (and the CPU runtime). On the other hand, all the required GPU packages (e.g. gpu compiler, gpu runtime etc.) come with driver only. Most of the cases, performance and compiler related issues are directly related to driver itself, nothing to do with APP SDK. For the same reason, APP SDK is release once in a while, whereas you get a new Catalyst driver almost each month. Hope I'm able to clear myself.

The discussion you linked uses clpeak, not clinfo. I don't know what clpeak is.
There may be an issue with clinfo reporting from SDK-2.9.1, but I'm using the latest catalyst.

I didn't refer the clPeak project itself. Rather I just wanted to tell you to check the clinfo15.5.zip file which contains clinfo output of catalyst 15.5. Sorry, I didn't say clearly enough.

Yes, it seems that somehow the clinfo link has been broken and it's pointing to somewhere else. It should point to the clinfo comes with the driver package.

A) Read/Write Buffers
WriteBuffer: 173 ms for 6241 calls each@.02773 ms
ReadBuffer: 122 ms for 390 calls each@.312 ms
B) Map/Unmap Buffers
MapBuffer: 193 ms for 6630 calls each@.02907 ms
UnmapBuffer: 120 ms for 6630 calls each@0.01811 ms
Notice that actually the sum of Read/WriteBuffer calls is slightly less than the sum of the Map/UnmapBuffer calls

Are you indicating the total running time? I guess, better to consider the per call cost, not the overall one because number of calls was not same for both the cases.

Now coming to the BufferBandwidth sample. I was running it using a d-gpu card. From my observation, I just want to highlight one case as an example.

(only relevant portions are shown here)

Case 1: Default i.e. without any command line option.

---------------------------------------------------------------------------

outputBuffer: CL_MEM_WRITE_ONLY

3. GPU kernel write to outputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS) | 133.099

4. Host mapped read of outputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- READ (GBPS) | 2.936

---------------------------------------|---------------

CPU read (GBPS) | 4.124

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS) | 462.137

-----------------------------------------------------------------------------------

Case 2: <BufferBandwidth> -of 5

outputBuffer: CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR

3. GPU kernel write to outputBuffer

---------------------------------------|---------------

clEnqueueNDRangeKernel() (GBPS) | 2.617

4. Host mapped read of outputBuffer

---------------------------------------|---------------

clEnqueueMapBuffer -- READ (GBPS) | 683.713

---------------------------------------|---------------

CPU read (GBPS) | 3.926

---------------------------------------|---------------

clEnqueueUnmapMemObject() (GBPS) | 266.306

[Note on setup: Windows 64bit, APP SDK 3.0 beta, Hawaii XT card, one of the latest internal driver]

As you can see, buffer mapping became much faster in case of zero copy buffer. However, access time from kernel dropped a lot.

Now, in your case, I'm not sure about the exact usage of the zero-copy buffer. I think, it would be helpful for us if could provide a test-case so that we can check it here. Please also provide your setup details and clinfo output.

Regards,

nibal · ‎09-23-2015

Hi Dipak,

Thanks for your time in this.

>> Please don't relate catalyst driver issues with APP SDK. In fact, most of the issues you pointed out are related to driver only.

Good to know. Had assumed that opencl compiler was with the SDK.

>> The discussion you linked uses clpeak, not clinfo. I don't know what clpeak is.

>> There may be an issue with clinfo reporting from SDK-2.9.1, but I'm using the latest catalyst.

> I didn't refer the clPeak project itself. Rather I just wanted to tell you to check the clinfo15.5.zip file which contains clinfo output of catalyst 15.5. Sorry, I didn't say clearly enough.

> Yes, it seems that somehow the clinfo link has been broken and it's pointing to somewhere else. It should point to the clinfo comes with the driver package.

Searching my system I could find 2 clinfos, both of them giving the same result:

/usr/local/bin/clinfo -> /opt/AMDAPPSDK-2.9-1/bin/x86_64/clinfo

/usr/bin/clinfo: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.16, stripped

/usr/bin/clinfo | grep Driver

Driver version: 1445.5 (VM)
Driver version: 1445.5 (sse2,avx,fma4)

The only link I see is the /usr/local/bin. How do I fix it?

>> A) Read/Write Buffers

>> WriteBuffer: 173 ms for 6241 calls each@.02773 ms

>> ReadBuffer: 122 ms for 390 calls each@.312 ms

>> B) Map/Unmap Buffers

>> MapBuffer: 193 ms for 6630 calls each@.02907 ms

>> UnmapBuffer: 120 ms for 6630 calls each@0.01811 ms

> > Notice that actually the sum of Read/WriteBuffer calls is slightly less than the sum of the Map/UnmapBuffer calls

> Are you indicating the total running time? I guess, better to consider the per call cost, not the overall one because number of calls was not same for both the cases.

Well, actually I indicate all: Total time, total calls, average cost/call as reported by CodeXL. Notice that I compare sums. This is because a map is needed both for writing input and reading results. In fact from (A) 6241 writes + 390 reads = 6631 calls ~ 6630 maps/unmaps from (B). This is just to verify that these are the absolutely minimum needed calls for each case. Therefore, the bottom line, is the sum of costs for all reads/writes and all maps/unmaps.

> As you can see, buffer mapping became much faster in case of zero copy buffer. However, access time from kernel dropped a lot.

Makes sense. Access to host memory is slow and comparable for kernel write (0-copy) with MapBuffer(Read) of case (1). Saved time by 0-copy is the time needed by kernel in case (1) to write to local memory.

> Now, in your case, I'm not sure about the exact usage of the zero-copy buffer. I think, it would be helpful for us if could provide a test-case so that we can check it here. Please also provide your setup details and clinfo output.

Thanks for looking into this. My setup:

Ubuntu 14.04 x64. All hardware is detailed in file hw. Environment variables are in file env. clinfo output is in file clinfo.

SDK is 2.9-1. For a test case you can just run the Bandwidth sample.

I hope you can reproduce the problem, just by running Bandwidth sample with SDK 2.9-1 under *any* environment

Let me know if you need smt else.

dipak · ‎09-24-2015

Thanks for providing the setup details.

For a test case you can just run the Bandwidth sample.
I hope you can reproduce the problem, just by running Bandwidth sample with SDK 2.9-1 under *any* environment

I think it would be better if you can point out the section(s) of the bandwidth sample's output which you want to refer as possible bug. I'll check and report to the concerned team, if needed.

Regards,

nibal · ‎09-26-2015

Hi,

Sorry for the late reply. It would help if the thread could notify/mark responses.

The first time i ran BufferBandwidth I didn't do it correctly. I assumed that for the Map/Unmap case it would provide its own flags (*HOST_PTR). So results were similar for both cases. Now I reran it, setting flags explicitly:

=> BufferBandwidth -type 0 -of 5 -if 5

[...]

1. Host mapped write to inputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- WRITE (GBPS) | 2798.690
---------------------------------------|---------------
memset() (GBPS) | 10.350
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 577.834

2. GPU kernel read of inputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 5.758

Verification Passed!

3. GPU kernel write to outputBuffer
---------------------------------------|---------------
clEnqueueNDRangeKernel() (GBPS) | 5.309

4. Host mapped read of outputBuffer
---------------------------------------|---------------
clEnqueueMapBuffer -- READ (GBPS) | 2196.737
---------------------------------------|---------------
CPU read (GBPS) | 4.291
---------------------------------------|---------------
clEnqueueUnmapMemObject() (GBPS) | 558.413

Verification Passed!

Passed!

Kernel access is markedly reduced, but 0-copy works, as in your case.

Host memory bus is pretty slow, comparable to PCI-e interconnect speed.

Thanks for your time, and my apologies for all the trouble I put you through.

Unfortunately I cannot close the issue or mark your answer as "correct", since it is not a question.

dipak · ‎09-28-2015

Hi nibal,

Thanks...

It would help if the thread could notify/mark responses.

Actually it does send the automatic notifications as per the user's preferences. Please make sure that you've set the notification preferences (User icon-> Preferences) accordingly. If you still don't get any, please report to us.

Thanks for your time, and my apologies for all the trouble I put you through.

No problem...Sharing of knowledge and information..That's why this forum is all about...

You are always welcome...Hope, see you soon again...

Regards,

Archives Discussions

Bug in clEnqueueMapBuffer in SDK 2.9-1