cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

corry
Adept III

Double Buffering...

So the subject really doesn't capture the question, but its a bit complicated as I've tried to do about as much detective work as I can already.

Background:  I'm implementing double buffering to have the CPU do some post processing of GPU created data buffers.  There are just some things the CPU is much better suited to doing.  So I figure I'll double up all my buffers, and use calCtxSetMem to do my actual "page flipping" (not really a single page...). 

So I implement it, and all look ok, until I look at the performance numbers.  I get this weird alternating performance where when 0 indexing runs of the kernel, the even ones are running at the same speed as the non-double-buffered version was.  The odd ones (pun intended? you decide run at almost double the speed. 

So I investigate further.  The big difference between the two?  Mapping the input data.  The even ones take ~2.5 seconds to map 8 MB.  The odd ones take ~1.6 seconds to map the same amount of data (verified its being told its the same amount of data). 

I even pulled up in the debugger the structures I created for storing the resource parameter info.  They are identical minus the resource numbers assigned by CAL.

So far, I checked the first 128 bytes of return data, and its all correct in both cases.  So why is it that once it takes so much longer?  Is what I'm doing for double buffering just not allowed with unpredictible results?  If not, is there some way to speed things up?

It did help me in that I didn't realize this kernel was so....so...so very memory bound....I had thought of ways to reduce the memory transfers somewhat, but didn't expect 8MB to make that much difference!  (The total time for the kernel to run map the memory, check the results is about 2.8 seconds, and apparently 2.6 of that is memory transfer...)

I'm at a complete loss here beyond doing some sort of exhaustive check to ensure all results from the odd case come back correctly under all inputs, and conditions.  I'm hoping theres a logical explanation for the behavior and I can skip that

0 Likes
5 Replies
corry
Adept III

Another half-hour of staring at the code, and I have the numbers back to matching up, but still no answer...really its more like the plot thickens...

So The one piece I was "missing" was when creating my double buffer, I copied *ALL* parameters, including the constant buffer.  I never initialized the constant buffer though.  There's no short circuit in the kernel on the data in the cb, and as I said, using the QueryPerformanceCounter method, all the time was in mapping the input uav parameter.  By not putting anything in the cb, it was short circuiting the uav buffer time?  But my results were correct??? 

Yeah, like I said, no answer...I'll be reducing the memory footprint here asap though...

So just a little more fighting and I think I have an answer, but I'll wait for confirmation from AMD.  calResMap will not run asynchronously even if the memory is not bound to anything.  So I take it I *have* to use calMemCpy.  However, that appears only to go to a cal resource, which I'd have to use calResMap on again.  It wasn't smart enough before to know that the memory wasn't in use, so will calResMap this time understand that its a CPU pointer and not bound to anything, and therefore allow me to map it before the kernel is finished?  If so that might work...something tells me I'm just going to try it though and answer my own question...

0 Likes

Ok, so just for future reference...I set up a host pointer, and used calMemCopy as I said, and that does in fact allow the calResMap to take place. 

Can I get this as a request for the next interface?  Have the system be smart enough to see if a parameter isn't bound to a kernel, to just do the async DMA transfer for me?  I've gone ahead and hidden it in my CAL wrapper library (which actually generally makes working with CAL pretty easy, just all the debugging necessary when adding stuff to it...).  I don't much care if it's doing the same thing behind the scenes as what I'm doing here, I'd just prefer the process for double buffering be more straightforward/simple/streamlined/etc.  Currently the tasks I'm performing at the same time are pretty minimal, but having the double buffering working opens us up to a whole realm of possibilities.

Hopefully all this helps someone else out there doing similar...hopefully when the low level FSAIL system comes out this will be 100% irrelevant.  Speaking of which, since last time I checked (a long time ago) it said CAL would be removed in SDK 2.7, does this mean we'll get FSAIL in SDK 2.7?  Are we expecting it this month?  Is there a beta out we can use to start porting stuff using seperate machine from the production env?  Should I just start another topic for that list of questions?

0 Likes

CAL won't be removed as this would break applications already out there, but it is being deprecated and won't be updated anymore. OpenCL will also not be using CAL after SDK 2.7.

We do sometimes put out pre-release catalyst versions and that includes the next iteration of our runtime/compiler, but we have not in the past put out beta versions of the SDK for public use.

Hope this helps.

0 Likes

Can you neither confirm nor deny FSAIL will be in the next SDK release?

I'd hoped for a suggestion noted as well...I'd also like to know a little more on the short circuit behavior when not putting any data in the constant buffer...that was odd, I'd have though it would have just processed with whatever was in there...

0 Likes
corry
Adept III

So, as it turns out, no, this isn't working either.  It had simply moved to my other map call.

Upon further investigation, I'm sitting waiting for calMemCopy for the time it takes for the kernel to run.  2.8 seconds to transfer 8MB of data?  Doubtful.  So what is the prefered way to get async copies?  The generation of the next buffer as it turns out is fairly computationally complex, so asynch operation is pretty much required...I could hack around this and just make it a memcpy, but then thats 2 memcpys to do one item.  Is it a bug or is there something special I need to do?

0 Likes