cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jski
Journeyman III

GPU memory architecture?

Some quick (hopefully) questions:

1) Where does streamRead() store the data on the video card (texture memory)?

2) Where does streamWrite() get the data from to write back to RAM?

3) Is there a good discussion of this somewhere?

---jski

0 Likes
11 Replies
jean-claude
Journeyman III

Same questions, plus one related to memory block boundaries:

How does Brook/CAL cope with out-of-memory explorations, ie for

instance whenever ones points to a pixel outside of the texture?

 

Let me  just take a very simplistic example to illustrate:

 

 

 

 

 

 

 

 

 

 

kernel

 

void test_average(out float4 filtered<>, float4 input[][]) {

 



 

const float2 si = indexof(filtered);  // pixel position

 

 

float3 val = 0.0f;  // local average

 

 

float i, j;

 

 

for  (i = -2.0f; i <= -2.0f; i += 1.0f) {

 

 

   for  (j = -2.0f; j <= -2.0f; j += 1.0f) {

 

 

      float2 index = {i, j};

 

 

      float4 val = input[si + index]; // here: risk of out of bounds

      dis += val;

    }

// end j loop

 }

// end i loop

filtered = val/25.0f;

}

So: what is for instance the behaviour at the time i=-2.0f and j=-2.0f ?

Which memory positions are sampled?

Should the boundary check be performed within the kernel code? In this case obviusly additional computation overhead is implied...

BTW. I did run a magnified example of this for 1024*1024 image filtering with no in kernel boundaries check, apparently (and surprisingly!) the result is perfect...

So, all this brings us back to jski question: Where and how does streamRead() store the data on the video card?



0 Likes

Originally posted by: jean-claude Same questions, plus one related to memory block boundaries:

How does Brook/CAL cope with out-of-memory explorations, ie for instance whenever ones points to a pixel outside of the texture?

 Let me  just take a very simplistic example to illustrate:

kernel void test_average(out float4 filtered<>, float4 input[][]) {  

const float2 si = indexof(filtered);  // pixel position 

float3 val = 0.0f;  // local average 

float i, j;  

 

for  (i = -2.0f; i <= -2.0f; i += 1.0f) {  

  for  (j = -2.0f; j <= -2.0f; j += 1.0f) {  

      float2 index = {i, j};  

      float4 val += input[si + index]; // here: risk of out of bounds      

    }

}

 

}

 

So: what is for instance the behaviour at the time i=-2.0f and j=-2.0f ?

 

Which memory positions are sampled?

 

Should the boundary check be performed within the kernel code? In this case obviusly additional computation overhead is implied...

 

BTW. I did run a magnified example of this for 1024*1024 image filtering with no in kernel boundaries check, apparently (and surprisingly!) the result is perfect...

 

So, all this brings us back to jski question: Where and how does streamRead() store the data on the video card?

 

 

 

 



 



filtered = val/25.0f;

filtered = val/25.0f;

}

 

So: what is for instance the behaviour at the time i=-2.0f and j=-2.0f ?

 

Which memory positions are sampled?

 

Should the boundary check be performed within the kernel code? In this case obviusly additional computation overhead is implied...

 

BTW. I did run a magnified example of this for 1024*1024 image filtering with no in kernel boundaries check, apparently (and surprisingly!) the result is perfect...

 

So, all this brings us back to jski question: Where and how does streamRead() store the data on the video card?

 

 

 

 



 



 



 





 



 

 

 



 

 

 



 

 

 

 

 

 



 

 

 



 

 

 

 

 

 



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 



 

0 Likes

A streamRead operation can be mapped to the CAL function calls of:
calResMap of the memory resource mapped to your stream
memcopy the data from your data pointer to the stream resource
calResUnmap of the memory resource
calMemCopy the memory to the graphics device(this would only be required if the resource was allocated as remote).

The streamWrite operation would do the opposite.

When using the CAL backend in Brook+, all reads that go out of bounds are clamped by the hardware to the boundary pixels. When using the CPU backend a buffer overfun occurs and the results are undefined.
0 Likes

Micah,

I think jski was asking physically.
0 Likes

There are three memory domains that are important in developing GPGPU apps, Host memory, Host PCIe memory and Graphics memory. CAL resources, which are used by Brook+, can be located in two of the three locations. If the memory is created using the local version of the API function, calResAlloc*, then the memory is stored on the graphics card in RAM, if it is created with the remote version, then it is stored in PCIe memory.
0 Likes

Could you expand upon (discuss further) these three three memory domains:

  1. Host memory,
  2. Host PCIe memory, and
  3. Graphics memory

and their importance in GPGPU programming?

---jski

0 Likes

jski,
1) The host memory is the memory domain that can be directly mapped into what a normal program would use. This memory is only available to the user space program and the GPU cannot access it. This is where all your data structures and program data will reside in the normal run of a computing session.
2) The host pcie memory is a section of main memory on your system that is set aside and mapped into the PCIe memory space. This memory is accessible from both the host program and the GPU and thus can be modified by both. Modification of this memory requires synchronization between the GPU and CPU usually with the calCtxIsEventDone api call. In brook+ this is handled for you.
3) The graphics memory is the GPU's version of host memory. It is only accessible by the GPU and not accessible via the CPU. There are three ways to copy data to the GPU memory, either implicitly through calResMap/calResUnmap or explicitly via calCtxMemCopy or via a custom copy shader that reads from PCIe memory and writes to GPU memory.

The major importance between these three interfaces is the amount of copying involved. In a single naive program that doesn't handle memory transfers, like all of the samples, there is a double copy involved between host->pcie and pcie->graphics. This is why you see a huge performance difference between the System GFlops and the Kernel GFlops. With proper memory transfer management and the use of system pinned memory, via calCtxResCreate in the cal_ext.h, the copy between host->pcie can be removed. However, it is not a very easy API call to use and comes with a lot of caveats.

The reason for this problem is just the memory bandwidth that can be achieved. Host->PCIe copies are usually in the hundreds of mb/s, PCIe->graphics memory are in the GB/s range and on-chip memory bandwidth is in the tens->hundred GB/s range. In GPGPU programming, you want to drastically reduce these copy bottlenecks and can be done through either pipelining execution and copies or some other novel technique.
0 Likes

"In a single naive program that doesn't handle memory transfers, like all of the samples, there is a double copy involved between host->pcie and pcie->graphics."

Could we get an example of say, simple_matmult, which is optimized to eliminate unnecessary copies?

---jski

0 Likes

Originally posted by: MicahVillmow jski, 1) The host memory is the memory domain that can be directly mapped into what a normal program would use. This memory is only available to the user space program and the GPU cannot access it. This is where all your data structures and program data will reside in the normal run of a computing session. 2) The host pcie memory is a section of main memory on your system that is set aside and mapped into the PCIe memory space. This memory is accessible from both the host program and the GPU and thus can be modified by both. Modification of this memory requires synchronization between the GPU and CPU usually with the calCtxIsEventDone api call. In brook+ this is handled for you. 3) The graphics memory is the GPU's version of host memory. It is only accessible by the GPU and not accessible via the CPU. There are three ways to copy data to the GPU memory, either implicitly through calResMap/calResUnmap or explicitly via calCtxMemCopy or via a custom copy shader that reads from PCIe memory and writes to GPU memory.


Which of these memories are cached and where does the local memory of the stream processors fit into this hierarchy? Do the stream processors have separate local memories or is this part of the graphics memory? There seems to be a huge void of information in the programming guide.

Thanks in advance.

0 Likes

Jski,
I'll pass on your request.
0 Likes

Where does a "stream" physically reside on the GPU? I have the following function:

void throughputUp()

{

float* test1 = (float*)malloc(sizeof(test1[0])*8192*8192);

float* test2 = (float*)malloc(sizeof(test2[0])*8192*8192);

float* test3 = (float*)malloc(sizeof(test3[0])*8192*8192);

float* test4 = (float*)malloc(sizeof(test4[0])*8192*8192);

float* test5 = (float*)malloc(sizeof(test5[0])*8192*8192);

float testGPU1<8192,8192>;

float testGPU2<8192,8192>;

float testGPU3<8192,8192>;

float testGPU4<8192,8192>;

float testGPU5<8192,8192>;

streamRead(testGPU1,test1);

streamRead(testGPU2,test2);

streamRead(testGPU3,test3);

streamRead(testGPU4,test4);

streamRead(testGPU5,test5);

}



This function causes a resource allocation exception on the 4th streamRead call. According to my calculations, the GPU should have enough storage for this much data. My GPU has 2GB ram (Firestream 9170). Each matrix is 256MB in size, and there are five of them (thus, 1.25GB total). Also, I tried putting a single streamRead in a for loop loading to the same location/stream. However, I was getting unreasonable bandwidth numbers (125GB/s). Is the Brook+ environment aware to the fact that I've already loaded these streams?

0 Likes