Archives Discussions

paulv · ‎10-17-2011

I'm doing a project that is about overlapping data transfer with computation on a CPU/GPU architecture. I have figured out how to do it on GPUs that have there own separate global memory. However I'm not sure if a similar approach can be taken to speed up AMD Fusion architectures. As far as I can tell there is no separate global memory to speak of, so I'm guessing it falls outside of the scope of my project, but I'd like to see a clear document that describes whether this is relevant for my project or not. Does anyone know where to find this?

Also there are supposed to be GPUs that are integrated directly into the motherboard. What are those called and do those have separate global memory?

corry · ‎10-17-2011

Look at the FSA roadmap, as thats where I derive most of what I "know" about it... Currently, from my best understanding of the docs, they share ram, yes, but not cache, or a virtual address space. GPU's are still on an on-die PCIe link, and as such, are treated similarly to Current GPUs. The difference is that system ram speed now can effect GPU performance.

That said, and again, from my understanding of where things are now, you still do things the GPU way. i.e. using memcpy will not produce defined results with GPU ram. I would suppose this is for a few reasons. One, the GPU isn't operating in your virtual address space, as such, its physical ram pages need to be mapped to your program's virtual pages. Second, they GPU would have to be notified that changes have been made, and of course, for the other direction, the GPU would have to be able to notify the CPU that ram has changed. Currently, there is no mechanism in place to do so. FSA is no trivial effort! It'll be worth it though!

paulv · ‎10-17-2011

Ok so if I understand you correctly they both use the same physical RAM, but always use disjoint portions of it. When you do a memcopy, data is transferred RAM-to-RAM (from the CPU's physical address space to the GPU's or vice versa).

So knowing that, can you overlap data transfer and kernel execution like you can with separate GPU cards (including things like page-locking memory and DMA engines)?

corry · ‎10-17-2011

not quite...but let me answer the last question first. Yes, every technique used with a discrete GPU is still currently valid. The APU isn't fully integrated yet, so currently its more like a discrete GPU than an integrated one. Just remember over the next couple years, thats all hopefully going to change, so keep an eye out for it...I personally can't wait!

Now for a quick, slightly more in depth look at paging and virtual address spaces...should help to understand the problem better..

your program uses paging and a virtual address space with a Translation Look Aside Buffer (TLB) to map phsyical pages of ram into a virtual address space. Thus, your program thinks it has the entire machine to itself, can't see any other program's data, and is generally happy that way. Same goes for all the other programs on your computer, they all think they have total control over everything. Truth is, they don't. They don't even know where the ram at any given address actually is. It might not even be ram, it may be paged out to disk, the program just doesn't know. See the difficulty? The GPU portion of the processor isn't inside your program, nor is your program inside it (though your Kernel is). So one cannot speak to the other without some translator sitting in between.

Thats where the OpenCL memory functions come in. They speak both to your program, and to the video driver, so they can both agree on wat to call address 0xXXXXXXXX, and the CPU program can tell the GPU that it has changed the contents of the ram, and the GPU can tell the driver to tell your program that it has changed the contents of that ram. Because there is no hardware accelleration of this (no TLB for the GPU address space, and vice versa), this is slow.

If any of those terms aren't review, (paging, virtual address space, tlb, etc) jump on wikipedia, there are some good articles for all of them...I know we had to work out how it was all done back in my college days in a class that covered operating systems and computer architecture...fun class honestly! Except when we covered multithreaded programming with BACI, aka C--....worst programming "language" at the time, ever. I still remember I had a bug where if I did if (!someInt) it would fail no matter what, I had to do if (someInt==0) to get it to work....took many hours to find! Ahh.....the memories 🙂

paulv · ‎10-18-2011

Originally posted by: corry not quite...but let me answer the last question first. Yes, every technique used with a discrete GPU is still currently valid. The APU isn't fully integrated yet, so currently its more like a discrete GPU than an integrated one. Just remember over the next couple years, thats all hopefully going to change, so keep an eye out for it...I personally can't wait!

[...]
Thats where the OpenCL memory functions come in. They speak both to your program, and to the video driver, so they can both agree on wat to call address 0xXXXXXXXX, and the CPU program can tell the GPU that it has changed the contents of the ram, and the GPU can tell the driver to tell your program that it has changed the contents of that ram. Because there is no hardware accelleration of this (no TLB for the GPU address space, and vice versa), this is slow.

I'm not quite sure what you are saying. In the first part you basically say that everything works the same from a programmer's perspective. But in the second part you seem to describe a way to synchronize the CPU and GPU in such a way that they can use the same physical address, which is very different from copying data from one place to another. I would say that even if that was done using the same functions that would have implications on performance.

I'm familiar with paging etc. by the way.

LeeHowes · ‎10-18-2011

You *can* do exactly the same thing with APU. The problem is the limited memory bandwidth and whether doing those copies will just occupy too much of it. The major optimisation with APUs is to not do any transfers at all, what we term zero copy in the optimisation guide.

If you setup a buffer correctly with ALLOC_HOST_PTR (check the APP SDK programming guide's optimisation chapter... it goes through this) you can allocate a buffer such that it is allocated in CPU space and used by the GPU without copies. As the caches are non-coherent and to keep within the OpenCL spec you do theoretically have to do map and unmap operations, but they are very fast doing just a cache flush or, under some circumstances, page table updates (I think). In practice it works without these and really is zero copy.

The down side is that you lose GPU or performance by doing it on current APUs because the way the buses work doesn't give fully cache coherent access on fast accesses, and when it is snooping in one direction to correctly share with CPU caches performance is lower. This is one of the things you can expect to see improve drastically over the next couple of years.

corry · ‎10-18-2011

Originally posted by: LeeHowes As the caches are non-coherent and to keep within the OpenCL spec you do theoretically have to do map and unmap operations, but they are very fast doing just a cache flush or, under some circumstances, page table updates (I think). In practice it works without these and really is zero copy.

Interesting, I had figured this was possible, but didn't realize it would actually do it.

To answer what I was saying before, for your program to be able to copy, you need to map the devices memory into your programs virtual address space, or they have to provide a method to simply upload to their device. In OpenCL/CAL, they choose to map the device memory into the program's virtual address space, (or at least, provide a buffer that they will upload/download to/from the device....not really sure since PCIe I believe can map device memory to CPU addressable addresses, which could then be mapped into a program's virtual address space, but that's really irrelevant 🙂 )

Anyhow though, I guess what's being said is you still have to follow the API, just like you were copying to a discrete card, but the system is smart enough to know to just use the same addresses, and not actually copy a buffer from the host to the device...Pretty cool. Like I have been saying, APU's are exciting....so much so that its the first thing I've been excited about in the CPU world since Intel had originally announced merced....which through a comedy of errors, firings, employees leaving, HP, and other mistakes became Itanium...back in what, 1998 or so when they announced/leaked info, it certainly sounded exciting...But I digress yet again 🙂

genaganna · ‎10-25-2011

Originally posted by: paulv I'm doing a project that is about overlapping data transfer with computation on a CPU/GPU architecture. I have figured out how to do it on GPUs that have there own separate global memory. However I'm not sure if a similar approach can be taken to speed up AMD Fusion architectures. As far as I can tell there is no separate global memory to speak of, so I'm guessing it falls outside of the scope of my project, but I'd like to see a clear document that describes whether this is relevant for my project or not. Does anyone know where to find this?

Also there are supposed to be GPUs that are integrated directly into the motherboard. What are those called and do those have separate global memory?

There are two types overlapping data transfer with computation

1. Implicit method (Use Zero Copy buffer)

2. Explicit method (Use async method)

Above methods will work both APU and external GPU.

All these are implemented in MonteCarloAsian sample shipped with SDK.

Archives Discussions

Data copying with AMD Fusion