Archives Discussions

pawdzied · ‎02-04-2015

Hi,

I'm working on my paper about processing signals on GPU. From my pre-research I've found out that when it comes to processing input vector/matrix, which will come at about 100MB/s it becomes unprofitable to execute calculations on GPU, though algorithm is well paralleled. The problem is bandwidth of PCI-E interface and need to copy data from CPU memory to GPU memory. As far as I understand when I will be able to build my system around AMD's APU with HSA architecture I should be able to omit this bottleneck and I should be able to 'get back' into 10x CPU performance of my application.

Could you please tell me how well this thing will work? If it works that simple I guess it will be much faster to execute this program in Kaveri APU than on high-end R9 GPU, am I correct?

Thanks for all replies.

Best Regards,

Pawel

jason · ‎02-05-2015

well it matters also how much data you submit at a time and what transfer mechanisms you use. with the datarate i mentioned i process data from multimegapixel cameras at 30-100hz (also a rt application). if you are trying to submit tiny pieces of data alot, theres going to be overhead for that and you must decide what is an appropriate latency (chunk of samples) to process. for references in my own benchmarks I can upload 4 MiB to laptop and desktop gpus at under a millisecond. download tends to take 2x and so is 1.5ish ms for that same chunk size. On modern intel procs, i can barely touch every pixel in that time frame. i suggest you do some do some additional analysis for your given algorithm to see where it lands and what the costs of the uploads and downloads are in addition to their granularity with wrt your sensor..

anyway if your operations are indeed so simple (must be some straight macs or something) avx is probably better to utilize in this instance. hsa/apus would also work well provided their latencies are acceptable but are probably overkill unless the lds removes a memory bottleneck from your simple operations but there is definitely a niche they can fill for when they have lower latency for finished computations than gpus.

View solution in original post

jason · ‎02-05-2015

PCIe gen3 X16 will provide you with 32 GiB of bandwidth per second theoretical. I have processed 120 MiB/sec and and the download/upload time is insignificant if you're only doing it once - I would suspect its sub-millisecond transfer time on your configuration. Then it's all about how long it takes you to process each element which depending on gpu and matrix dimension I'm guessing is sub-millisecond to maybe up to 10. i would suspect under most configurations that it still is profitable to process on the GPU than the CPU with this many numbers and with even elementary operations - the GPU may leave room for more thorughput and have a lower latency to completion. For a modestly recent GPU and simple signal processing I would expect a well made kernel to execute between 100-1000x times faster than a beefy intel can muster.

Re Kaveri - there's only 8 CUs with it and worst of all while it has direct access to system memory, this is much slower (1/3-1/5+) than the memory on a discrete GPU packing GDDR5. This can easily become a bottleneck. If they add on another region of memory where it offers 150GiB/s or more in future generations however it would be very appealing for some applications to avoid a GPU card and instead use the on-die chip to avoid the PCIe bottleneck.

pawdzied · ‎02-05-2015

Right... But this will be Real Time application, it doesn't do any complex operations, but have to deal with large number of input data (which will come at about 100 MB/s) from sensors. With our data it's been tested (I haven't tested it myself, that's what one of professors done) to work "not faster" on GPU if input data comes that fast, though kernel was working very quick.

We have to do something like that in loop:
GPU oriented program:
get data from input sensors -> prepare data to parallel computation -> copy data from DDR3 GP memory to GDDR5 device memory over PCI-E -> -> do calculations (very small amount of time in fact, when using gpu) -> copy from device memory to DDR3 -> merge data
CPU oriented program: ( multithread )
get data from input sensors -> prepare data to parallel computation -> compute (takes much longer) -> merge data
HSA program:

get data from input sensors -> prepare data to parallel computation -> compute (longer than discrete GPU, but faster than CPU) [GPU PART]-> merge data

In GPU oriented program we still have to read from slow ddr3 memory in order to copy data into gddr5 memory, then - again - we will have to read it from GDDR5 memory to perform calculations. I guess we can skip "save to gddr5, read from gddr5, save to gddr5, read from gddr5" part and leave only "read from ddr3, save to ddr3" part, which have to be done anyway! Am I right in this field?

jason · ‎02-05-2015

well it matters also how much data you submit at a time and what transfer mechanisms you use. with the datarate i mentioned i process data from multimegapixel cameras at 30-100hz (also a rt application). if you are trying to submit tiny pieces of data alot, theres going to be overhead for that and you must decide what is an appropriate latency (chunk of samples) to process. for references in my own benchmarks I can upload 4 MiB to laptop and desktop gpus at under a millisecond. download tends to take 2x and so is 1.5ish ms for that same chunk size. On modern intel procs, i can barely touch every pixel in that time frame. i suggest you do some do some additional analysis for your given algorithm to see where it lands and what the costs of the uploads and downloads are in addition to their granularity with wrt your sensor..

anyway if your operations are indeed so simple (must be some straight macs or something) avx is probably better to utilize in this instance. hsa/apus would also work well provided their latencies are acceptable but are probably overkill unless the lds removes a memory bottleneck from your simple operations but there is definitely a niche they can fill for when they have lower latency for finished computations than gpus.

Archives Discussions

HSA memory access improvement in signal processing applications