cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jean-claude
Journeyman III

Some questions on CAL optimisation

Hi guys,

Hope you had a very nice Xmas day.

*wine*


Here is a batch of some basic questions thatcall for clarification, thanks for your hints

Jean-Claude


Performance and sync consideration for CAL kernels
(all the following assume operating on an unique context)

(1) On the use of command queue flush :
------------------------------------------------------
Assume having several kernels to be executed one after the other
what is the tradeoff between:
- issuing a flush after each each execution call in command queue
- or issuing a flush after having issued all execution calls ?


(2) On the use of several sequential kernels :
------------------------------------------------------------
for kernels to be executed sequentially, what is the tradeoff between:
- calling each kernel separately
- or merging them into an unique kernel ?


(3) On the use of multi-output kernels :
-----------------------------------------------------
What is the most efficient in term of performances

kernel K_1 (out float4 C<>, float4 A<>, float4 B<>;
kernel K_2 (out float4 D<>, float4 A<>, float4 B<>;

or

kernel K_3 (out float4 C<>, out float4 D<>, float4 A<>, float4 B<>;


(4) On kernels execution :
-----------------------------------
Is it safe to assume that:
(1) the order of execution of the kernels will be the same as the order of
execution calls in command queue ?

(2) is it correct to assume that no kernels are run concurrently
kernel K_1 (out float4 C<>, float4 A<>, float4 B<>;
kernel K_2 (out float4 D<>, float4 A<>, float4 B<>;
kernel K_3 (out float4 E<>, float4 C<>, float4 B<>;

ie for instance K_2 and K_1, K_2 and K_3 can run concurrently while
K_1 and K_3 are expected to run sequentially ???


(5) On the binding of kernel I/Os ie input, output, constant :
--------------------------------------------------------------------------------
BTW. What's the cost of calctxsetmem(ctx,inname,inputmem)

Is it safe to assume that IOs binding declarations for a kernel
are kept alive in the context or should they have to be issued
each time the kernel is to be executed ?


(6) On calMemCopy :
-----------------------------
CALresult calMemCopy(CALevent* event, CALcontext ctx,
CALmem srcMem, CALmem dstMem, CALuint flags);

What are the options for parameter flags??

0 Likes
6 Replies
jean-claude
Journeyman III

Additional question:

(7) Concurent work in memory while a kernel is running

Is it possible for the CPU to issue a calresMap and work on a memory resource C while a kernel is active on different memory resources (say A and B) and  calCtxEventDone is still CAL_RESULT_PENDING.

Again here the kernel operates on resources different from C.

Thanks

0 Likes

Hi Jean-Claude,


1. Command queue flush works on a specific context. All the commands associated to a context are kept in a queue to avoid CPU-GPU transfer overhead each time a new command is invoked. Though, I am not sure with CAL how effective this technique is.


2. Performance gain on GPUs depends on memory/ALU ratio. Usually merging multiple kernels into a single kernel should definitely improve ALU usage as well as it should reduce memory fetches and increase memory reuse.


3. It has same answer as 2. With K_3, you would be able to reduce your memory fetch and increase ALU operations.


4. Order of execution will be same as calling order. GPUs can't run multiple kernels concurrently.


5. They are kept alive. No need to bind them again.


6. Currently, not used. Use 0.


7. I haven't tested it, but it should be possible.


Hope you got answers for some questions.

0 Likes

Thanks Gaurav,

on point (1) Command queue flush, what I can say is that I've seen a substantial improvement in performance simply by not issuing a flush after each command but preferably by packing them and flush later on.

This raises another question: at the time the flush is issued I suppose the command queue is transferred from system to GPU, and that there is a local command queue associated to each context.

So the question is: what's the size limit of such (local) a command queue.

The reason i'm asking is that since I'm issuing very large command list (because of loop processing), a possible overload would maybe explain run time bug I'm experiencing.

Jean-Claude

0 Likes

It is 64k.

0 Likes

64k bytes or 64k commands?

0 Likes

64K commands

0 Likes