I wrote a program to test whether it was true.
My program worked like this
calResMap((CALvoid**)&fdata, &pitch, inputRes, 0);
for (int i = 0; i < LENGTH; ++i) {
fdata[index] = inputData;
}
The inputData[] array was initialized with random floats, and index[] array determined how data are transferred to GPU (sequentially or randomly). The LENGTH is set to 300, 000
At the first run, it copied an array of floating points to GPU sequentially, i.e, index = i
In the second run, it copied the same data to GPU in a random order, i.e, index is first random_shuffled.
But it seemed that both run consumed nearly the same time. Does that mean CAL copy floats to gpu one by one?
Can I make it copy a chunk of data to GPU simutaneously with an operation like DMA, in order to improve performance?
Many thanks.
inputRes is local to CPU, and I test the speed to copy the array to the GPU.
I called calResUnmap immediately after transferring.
Here's the code snippet:
clock_t start = clock();
calCtxGetMem(&inputMem, ctx, inputRes);
calResMap((CALvoid**)&fdata, &pitch, inputRes, 0);
for (int i = 0; i < LENGTH; ++i) {
fdata[index] = inputData;
}
calResUnmap(inputRes);
printf("Elapsed time: %d\n", clock() - start);
The elapsed time grows linearly as length grows, which is expected.
Originally posted by: michael.chu@amd.com Can you try timing the following: - that section of code but with the loop commented out, - that section of code but with the CAL calls commented out. I'm curious to see what is dominating the time, the loop or the CAL calls. Michael.
I timed the following code:
calResMap((void**)&dataPtr, &pitch, resLocal, 0);
memset(dataPtr, 0, pitch * sizeof(float) * 4);
calResUnmap(resLocal);
For a 1D ressource of 8192 Float4, i get:
calResMap -> 155us
memset -> 67 us
calResUnmap -> 85us
For a 1D ressource of 1 Float4, i get:
calResMap -> 61us
memset -> 0.8us
calResUnmap -> 35us
This is quite slow, what is taking so much time? And why is the calResMap taking 2 times more time than carResUnmap?