cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ryta1203
Journeyman III

GPU Global Memory

I know this has probably been mentioned before but please refresh my memory:

If I have a kernel (streaming) that has 4 inputs and 1 output, how do I call the global memory in CAL/IL?

For inputs with size 1024 is it:

g[Tid.x+0], g[Tid.x+1], g[Tid.x+2], g[Tid.x+3]

OR

g[Tid.x+0], g[Tid.x+1024], g[Tid.x+2048], g[Tid.x+3072]

???

And for the output, what is it? If I use g[Tid.x+0] won't that overwrite my input value??

0 Likes
20 Replies
hazeman
Adept II

You can attach g register to only one input/output buffer ( image in CAL terminology and it can be both input and output at the same time ).

So it's you decision where in this buffer you want to place your data.

You can interleave data so g[Tid.x+0], g[Tid.x+1], g[Tid.x+2], g[Tid.x+3] is ok. But you can also put data from fist buffer, then from second and so on ( so g[Tid.x+0], g[Tid.x+1024], g[Tid.x+2048], g[Tid.x+3072] will be correct ).

And yes g[Tid.x+0] will overwrite data from first buffer. You can use some part of buffer for output data ( and offset there ).

On the cypres family you can use uavs to access multiple buffers ( so there is no need to struggle with putting all data into one buffer ). Also you could use TUs to read data from buffers - they give advantage of having cache ( could speed things up ).

PS. To be clear g[] indexing starts from 0 - so g[0] gives first float4 ( or int4 or uint4 ) from attached buffer.

 

0 Likes

Can anyone from AMD confirm this is correct?

0 Likes

Originally posted by: ryta1203 Can anyone from AMD confirm this is correct?

You really don't have to wait for ATI to confirm. Those informations are available in IL docs. You can also write small test kernel ( you will have to write quite few of them anyway ( as I had to ) - cause CAL docs are sometimes not clear ).

 

 

 

0 Likes

I will look at the docs again; however, this thread seems to suggest otherwise:

http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=116692&highlight_key=y&keyword1=gl...

Micah has said that for 8 outputs the offset is +0, +1, +2, etc...

This implies that for each output allocated in global memory the offset between elements in each output is the size of the number of outputs. It's possible thought that Micah didn't understand what I was asking.

Also, I would think this would be the same for inputs.

 

0 Likes

ryta,
The global buffer is a uniform address space. So you can only bind a single resource/memory to it. How you layout your input and output data is kernel specific. If you want bursting on global, you need to add 0, 1, 2, 3, ... to your base offset into the global.
I.E.
mov g[1024 + 0], r0
mov g[1024 + 1], r1
mov g[1024 + 2], r2
mov g[1024 + 3], r3

Would get you bursting.

Also, since the g register is both an input and output register and it is a uniform address space, writing to the wrong location can clobber your input data.
0 Likes

So if you  have a kernel with 8 inputs then the developer must do the combining of these inputs into one "uniform address space"? This seems like not a good idea to me. Also, the developer must keep track of the specific address of where the output begins? Again, this seems like not a good idea.

Also, OpenCL obviuosly has to "handle" this in some way, does it do some address translation/combining and how much overhead is there associated with that?

Also, so when you copy over 4 inputs, how is that handled? If I have 4 inputs of 1k size on the CPU side and want to copy to the gpu must I manually compress them into one array and then copy that array to global memory and then manually address each input using the compression offset?? Again, this does not seem like a good idea.

0 Likes

OpenCL currently uses a single UAV and maps all global and emulated pointers onto the same memory surface using a combination of cal api calls and copy shaders.
0 Likes

Micah,

  So what is the map layout? If you have two inputs and you want to acess the first element of each input would it be:

g[idx.x+0], g[idx.x+1] or would it be g[idx.x+0], g[idx.x+size]?

0 Likes

g[idx.x] and g[idx.x + size_first_buffer]
0 Likes

Micah,

  Thank you, this is what I thought. So if you have multiple outputs and you want to write out to the indexed element of each of those outputs then it's not possible to burst write, correct? Since the output would be: g[idx.x+0], g[idx.x+size_first_buffer]?

0 Likes

Originally posted by: MicahVillmow g[idx.x] and g[idx.x + size_first_buffer]


So is this considered coalesced (I know not normally, but I'm talking about how the memory hardware is laid out and how it's read)?

If not and OpenCL does it this way then why? Wouldn't it be better to interleave the input elements to achieve coalescing?

0 Likes

input/output interleaving is something that is app/kernel specific and is not something that we can reliably generate in the compiler stack.
0 Likes

Micah,

 So OpenCL doesn't do burst reading/writing to multiple inputs/outputs? Last question about this, thanks.

0 Likes

Not between i/o pointers, but it is possible to trigger bursting to the same pointer.
0 Likes

Micah,

  So essentially if I have 2 outputs, you are saying it's possible to burst write to mutliple consecutive locations of one output but not to burst write into both outputs, with the way it's setup now, because you can't garuantee that the two outputs are setup for consecutive location writes?

0 Likes

Burst writing only happens to sequential addresses. Where the memory exists doesn't matter, as long as the addresses are sequential, then the writes can be bursted. OpenCL can't interleave pointers since pointers can be aliased and point to the exact same memory location. Hope this helps.
0 Likes

The memory of my Radeon 5750 is not completely detected. Here is a part of the output of CLInfo:

...

Global memory size:                 268435456
  Constant buffer size:                 65536
  Max number of constant args:             8
  Local memory type:                 Scratchpad
  Local memory size:                 32768
 ....

Platform ID:                     0x7f28c5d9c890
  Name:                         Juniper
  Vendor:                     Advanced Micro Devices, Inc.
....

my config:

ubuntu 9.04 64 bits, AMD Athlon(tm) II X4 620 Processor

 

Is it normal ?

thanks.

0 Likes

yes this is normal.

hey AMD will future release support more than 256MB of memory?

0 Likes

other question: is it possible, with linux, to use the IGP of the mainboard for display in order to allocate the GPU only for computations ?

Maybe it is too much complicated at this stage of the driver development 😉

0 Likes

Originally posted by: vic20 other question: is it possible, with linux, to use the IGP of the mainboard for display in order to allocate the GPU only for computations ?


Display should be connected to GPU.

0 Likes