cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jean-claude
Journeyman III

Double memory copy in CAL ? What about calCtxResCreate ?

Optimizing memory exchanges between CPU & GPU

Hi ,

I have some basic question related to explicit/implicit memory exchanges between CPU and the GPU

Here is some case:

CALresource  Mrem;
calResAllocRemote1D(&Mrem, &device, 1, 1024, CAL_FORMAT_INT_1, CAL_RESALLOC_CACHEABLE);


This should allocate a 1024 int buffer in the PCI memory zone, correct?
Moreover we ask this resource to be cacheable to get better CPU R/W access perf.

Now consider:

CALuint ptchx = 0;
CALuint *pMem;

calResMap((CALvoid**)&pMem, &ptchx, Mrem, 0);
     for (int i=0; i<1024; i++) *pMem++ = i; // or whatever CPU treatment
calResUnmap(Mrem);


(0) Should Mrem have been allocated as a local GPU resource, my understanding is that an intermediate PCI memory zone would be implicitly allocated at calResMap and that this PCI zone would be implicitly copied to local GPU mem at calResUnmap. Right?

(1) Now since here Mrem is a remote resource:
Is the writing directly done in Mrem, or does it occur first on an
intermediate PCI memory zone and then is implicitly copied  to Mrem at the time of CalResUnMap(Mrem)?? Should it be the case then we face a hidden double copy...

(2) The documentation suggests that double copy can be avoided through careful use of "calCtxResCreate" which doesn't appear to be accessible in the SDK.
Is "calCtxResCreate" a ghost API?? How to use it?

Thanks

Jean-Claude

0 Likes
15 Replies
rahulgarg
Adept II

0) Correct.
1) Not a double copy. When you map a remote resource, AFAIK the pointer is returned immediately and no copy is done.
2) Its an extension so you need to get it through the calExtGetProc.

(Note : I am not from AMD)
0 Likes

Thanks Rahul for your inputs,

With respect to "calCtxResCreate" I can't find it neither on table B1 in AMD's Stream-Computing document nor in cal_ext.h header file.

Actually the only extensions that are mentioned are:

CAL_EXT_D3D9, CAL_EXT_OPENGL, CAL_EXT_D3D10 and CAL_EXT_COUNTERS...

This is what brought me thinking of calCtxResCreate as a ghost API !!!

Maybe some folks from AMD can clarify the matter.

Jean-Claude

0 Likes

Could you make sure you are using 1.3beta header files. I can see CAL_EXT_RES_CREATE extension id.

0 Likes

Did some file cleanup on my PC... I should have done a long time ago!

Now, ok I found CAL_EXT_RES_CREATE in cal_ext.h.

For sure I'll have a look when I've extra time, BTW is there a few lines of documentation related to calCtxResCreate proper use?

0 Likes
rahulgarg
Adept II

You are probably looking for calResCreate2D in cal_ext.h
0 Likes

Yes tht's it.

OK, just give a quick trial to check if understood properly:

CALresult r;

// First check if extension supported
r = calExtSupported(CAL_EXT_RES_CREATE);
if (r != CAL_RESULT_OK) return false;           // too bad it is not!!

// Get pointer to calResCreate extension
CALextproc calResCreate2D_proc;
r = calExtGetProc(&calResCreate_proc, CAL_EXT_RES_CREATE, "calResCreate2D");

// Now create 2D resource in system memory
CALresource XMem_Res=0;
float *p_buffer;
(calResCreate2D_proc)(&XMem_Res, device, &p_buffer, 64, 256, CAL_FORMAT_FLOAT_4, size_bytes, 0);

Here I have a question should size_bytes be 64*256*(4*4), ie w*h*sizeof(float4)... In this case why is this parameter needed ?

or is there any consideration for an optimal pitch?


// Then for instance init DMA transfer to this resource from a local GPU resource
CALmem XMem_Mem;
calCtxGetMem(&XMem_Mem,context,XMem_Res);

CALevent e;
r = calMemCopy(&e,context,local_Mem_M,XMem_Mem,0);
...

// Data should have been tranferred from GPU local memory to sytem memory buffer p_buffer

Right?

0 Likes

// Now create 2D resource in system memory CALresource XMem_Res=0; float *p_buffer; (calResCreate2D_proc)(&XMem_Res, device, &p_buffer, 64, 256, CAL_FORMAT_FLOAT_4, size_bytes, 0);

Some problems in the code -

p_buffer has to be allocated before use. Allocation requirements -

1. Number of elements in width should satisfy pitch alignment requirements. It should be integer multiple of CALdeviceattribs.pitch_alignment (64).

2. p_buffer should be mem_aligned with CALdeviceattribs..surface_alignment bytes (256 bytes).

========

You are right size_bytes is un-necessary. It has to match w*h*sizeof(format).

========

Regarding DMA - Yes, you should expect p_buffer to be updated with new data.

0 Likes

Thanks Gaurav,

This clarifies the matter.

I put it aside on my to-do list, and  I'll try to use it in the near future.

Have a nice day.

Jean-Claude

0 Likes

Are there any other requirements? I've found that I get CAL_RESULT_ERROR returned if the height is larger than some size (of which I have not yet determined that is less than 8192, my card's maximum allowable dimension size).

After getting a function pointer blah blah, I call

    err1 = calResCreate2D(&this->resource, this->device->dev,
      (CALvoid*)buffer, 640, 4096, type, 640 * 4096 * sizeof(float), 0);

This runs returns CAL_RESULT_OK and runs fine, but I really want a 640x8192 matrix. A 4097 height also works, so 4096 isn't the limit. 640 is a multiple of 64 and buffer was allocated to 256 byte alignment with posix_memalign(). What's the issue here? type in this case is FLOAT1.



0 Likes

I don't have the exact number. But, amount of memory available for pinned resource is much lesser than allowed via local or remote resource. Probably, you can try allocating multiple resources of 64*64 and see how many resouce and how much memory you are able to allocate.

0 Likes
rahulgarg
Adept II

I am not very sure myself. No documentation about this anywhere and no samples either. I am trying to do some experiments and once I am clearer about whats happening, I will get back to you.
0 Likes

rick.weber,
The amount of memory is limited to either 16MB or 64MB depending on your operating system. This is a limit that the CAL team is working with the driver teams to increase.
0 Likes

Well, since 8192*640*4 = ~20MB, I think it's safe to assume the limit is 16MB on my OS. Thanks for your help! Is this pinned limit based on a single contiguous block of memory, or all the memory you can have allocated at a single point in time (i.e., I can have more allocated so long no single buffer is greater than 16MB)?

0 Likes

Its total available memory that can be used.

0 Likes

Ok, thanks!

0 Likes