Some quick (hopefully) questions:
1) Where does streamRead() store the data on the video card (texture memory)?
2) Where does streamWrite() get the data from to write back to RAM?
3) Is there a good discussion of this somewhere?
---jski
Same questions, plus one related to memory block boundaries:
How does Brook/CAL cope with out-of-memory explorations, ie for
instance whenever ones points to a pixel outside of the texture?
Let me just take a very simplistic example to illustrate:
kernel
void test_average(out float4 filtered<>, float4 input[][]) {
const float2 si = indexof(filtered); // pixel position
float3 val = 0.0f; // local average
float i, j;
for (i = -2.0f; i <= -2.0f; i += 1.0f) {
for (j = -2.0f; j <= -2.0f; j += 1.0f) {
float2 index = {i, j};
float4 val = input[si + index]; // here: risk of out of bounds
dis += val;
}
// end j loop
}
// end i loop
filtered = val/25.0f;
}
So: what is for instance the behaviour at the time i=-2.0f and j=-2.0f ?
Which memory positions are sampled?
Should the boundary check be performed within the kernel code? In this case obviusly additional computation overhead is implied...
BTW. I did run a magnified example of this for 1024*1024 image filtering with no in kernel boundaries check, apparently (and surprisingly!) the result is perfect...
So, all this brings us back to jski question: Where and how does streamRead() store the data on the video card?
Originally posted by: jean-claude Same questions, plus one related to memory block boundaries:
How does Brook/CAL cope with out-of-memory explorations, ie for instance whenever ones points to a pixel outside of the texture?
Let me just take a very simplistic example to illustrate:
kernel void test_average(out float4 filtered<>, float4 input[][]) {
const float2 si = indexof(filtered); // pixel position
float3 val = 0.0f; // local average
float i, j;
for (i = -2.0f; i <= -2.0f; i += 1.0f) {
for (j = -2.0f; j <= -2.0f; j += 1.0f) {
float2 index = {i, j};
float4 val += input[si + index]; // here: risk of out of bounds
}
}
}
So: what is for instance the behaviour at the time i=-2.0f and j=-2.0f ?
Which memory positions are sampled?
Should the boundary check be performed within the kernel code? In this case obviusly additional computation overhead is implied...
BTW. I did run a magnified example of this for 1024*1024 image filtering with no in kernel boundaries check, apparently (and surprisingly!) the result is perfect...
So, all this brings us back to jski question: Where and how does streamRead() store the data on the video card?
filtered = val/25.0f;
}
So: what is for instance the behaviour at the time i=-2.0f and j=-2.0f ?
Which memory positions are sampled?
Should the boundary check be performed within the kernel code? In this case obviusly additional computation overhead is implied...
BTW. I did run a magnified example of this for 1024*1024 image filtering with no in kernel boundaries check, apparently (and surprisingly!) the result is perfect...
So, all this brings us back to jski question: Where and how does streamRead() store the data on the video card?
Could you expand upon (discuss further) these three three memory domains:
and their importance in GPGPU programming?
---jski
"In a single naive program that doesn't handle memory transfers, like all of the samples, there is a double copy involved between host->pcie and pcie->graphics."
Could we get an example of say, simple_matmult, which is optimized to eliminate unnecessary copies?
---jski
Originally posted by: MicahVillmow jski, 1) The host memory is the memory domain that can be directly mapped into what a normal program would use. This memory is only available to the user space program and the GPU cannot access it. This is where all your data structures and program data will reside in the normal run of a computing session. 2) The host pcie memory is a section of main memory on your system that is set aside and mapped into the PCIe memory space. This memory is accessible from both the host program and the GPU and thus can be modified by both. Modification of this memory requires synchronization between the GPU and CPU usually with the calCtxIsEventDone api call. In brook+ this is handled for you. 3) The graphics memory is the GPU's version of host memory. It is only accessible by the GPU and not accessible via the CPU. There are three ways to copy data to the GPU memory, either implicitly through calResMap/calResUnmap or explicitly via calCtxMemCopy or via a custom copy shader that reads from PCIe memory and writes to GPU memory.
Which of these memories are cached and where does the local memory of the stream processors fit into this hierarchy? Do the stream processors have separate local memories or is this part of the graphics memory? There seems to be a huge void of information in the programming guide.
Thanks in advance.
Where does a "stream" physically reside on the GPU? I have the following function:
void throughputUp()
{
float* test1 = (float*)malloc(sizeof(test1[0])*8192*8192);
float* test2 = (float*)malloc(sizeof(test2[0])*8192*8192);
float* test3 = (float*)malloc(sizeof(test3[0])*8192*8192);
float* test4 = (float*)malloc(sizeof(test4[0])*8192*8192);
float* test5 = (float*)malloc(sizeof(test5[0])*8192*8192);
float testGPU1<8192,8192>;
float testGPU2<8192,8192>;
float testGPU3<8192,8192>;
float testGPU4<8192,8192>;
float testGPU5<8192,8192>;
streamRead(testGPU1,test1);
streamRead(testGPU2,test2);
streamRead(testGPU3,test3);
streamRead(testGPU4,test4);
streamRead(testGPU5,test5);
}
This function causes a resource allocation exception on the 4th streamRead call. According to my calculations, the GPU should have enough storage for this much data. My GPU has 2GB ram (Firestream 9170). Each matrix is 256MB in size, and there are five of them (thus, 1.25GB total). Also, I tried putting a single streamRead in a for loop loading to the same location/stream. However, I was getting unreasonable bandwidth numbers (125GB/s). Is the Brook+ environment aware to the fact that I've already loaded these streams?