Archives Discussions

ryta1203 · ‎01-27-2010

I know this has probably been mentioned before but please refresh my memory:

If I have a kernel (streaming) that has 4 inputs and 1 output, how do I call the global memory in CAL/IL?

For inputs with size 1024 is it:

g[Tid.x+0], g[Tid.x+1], g[Tid.x+2], g[Tid.x+3]

OR

g[Tid.x+0], g[Tid.x+1024], g[Tid.x+2048], g[Tid.x+3072]

???

And for the output, what is it? If I use g[Tid.x+0] won't that overwrite my input value??

hazeman · ‎01-28-2010

You can attach g register to only one input/output buffer ( image in CAL terminology and it can be both input and output at the same time ).

So it's you decision where in this buffer you want to place your data.

You can interleave data so g[Tid.x+0], g[Tid.x+1], g[Tid.x+2], g[Tid.x+3] is ok. But you can also put data from fist buffer, then from second and so on ( so g[Tid.x+0], g[Tid.x+1024], g[Tid.x+2048], g[Tid.x+3072] will be correct ).

And yes g[Tid.x+0] will overwrite data from first buffer. You can use some part of buffer for output data ( and offset there ).

On the cypres family you can use uavs to access multiple buffers ( so there is no need to struggle with putting all data into one buffer ). Also you could use TUs to read data from buffers - they give advantage of having cache ( could speed things up ).

PS. To be clear g[] indexing starts from 0 - so g[0] gives first float4 ( or int4 or uint4 ) from attached buffer.

ryta1203 · ‎01-28-2010

Can anyone from AMD confirm this is correct?

hazeman · ‎01-28-2010

Originally posted by: ryta1203 Can anyone from AMD confirm this is correct?

You really don't have to wait for ATI to confirm. Those informations are available in IL docs. You can also write small test kernel ( you will have to write quite few of them anyway ( as I had to ) - cause CAL docs are sometimes not clear ).

ryta1203 · ‎01-28-2010

I will look at the docs again; however, this thread seems to suggest otherwise:

http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=116692&highlight_key=y&keyword1=gl...

Micah has said that for 8 outputs the offset is +0, +1, +2, etc...

This implies that for each output allocated in global memory the offset between elements in each output is the size of the number of outputs. It's possible thought that Micah didn't understand what I was asking.

Also, I would think this would be the same for inputs.

MicahVillmow · ‎01-28-2010

ryta,
The global buffer is a uniform address space. So you can only bind a single resource/memory to it. How you layout your input and output data is kernel specific. If you want bursting on global, you need to add 0, 1, 2, 3, ... to your base offset into the global.
I.E.
mov g[1024 + 0], r0
mov g[1024 + 1], r1
mov g[1024 + 2], r2
mov g[1024 + 3], r3

Would get you bursting.

Also, since the g register is both an input and output register and it is a uniform address space, writing to the wrong location can clobber your input data.

ryta1203 · ‎01-29-2010

So if you have a kernel with 8 inputs then the developer must do the combining of these inputs into one "uniform address space"? This seems like not a good idea to me. Also, the developer must keep track of the specific address of where the output begins? Again, this seems like not a good idea.

Also, OpenCL obviuosly has to "handle" this in some way, does it do some address translation/combining and how much overhead is there associated with that?

Also, so when you copy over 4 inputs, how is that handled? If I have 4 inputs of 1k size on the CPU side and want to copy to the gpu must I manually compress them into one array and then copy that array to global memory and then manually address each input using the compression offset?? Again, this does not seem like a good idea.

MicahVillmow · ‎01-29-2010

OpenCL currently uses a single UAV and maps all global and emulated pointers onto the same memory surface using a combination of cal api calls and copy shaders.

ryta1203 · ‎01-29-2010

Micah,

So what is the map layout? If you have two inputs and you want to acess the first element of each input would it be:

g[idx.x+0], g[idx.x+1] or would it be g[idx.x+0], g[idx.x+size]?

MicahVillmow · ‎01-29-2010

g[idx.x] and g[idx.x + size_first_buffer]

ryta1203 · ‎01-29-2010

Micah,

Thank you, this is what I thought. So if you have multiple outputs and you want to write out to the indexed element of each of those outputs then it's not possible to burst write, correct? Since the output would be: g[idx.x+0], g[idx.x+size_first_buffer]?

ryta1203 · ‎01-31-2010

Originally posted by: MicahVillmow g[idx.x] and g[idx.x + size_first_buffer]

So is this considered coalesced (I know not normally, but I'm talking about how the memory hardware is laid out and how it's read)?

If not and OpenCL does it this way then why? Wouldn't it be better to interleave the input elements to achieve coalescing?

MicahVillmow · ‎02-01-2010

input/output interleaving is something that is app/kernel specific and is not something that we can reliably generate in the compiler stack.

ryta1203 · ‎02-01-2010

Micah,

So OpenCL doesn't do burst reading/writing to multiple inputs/outputs? Last question about this, thanks.

MicahVillmow · ‎02-01-2010

Not between i/o pointers, but it is possible to trigger bursting to the same pointer.

ryta1203 · ‎02-01-2010

Micah,

So essentially if I have 2 outputs, you are saying it's possible to burst write to mutliple consecutive locations of one output but not to burst write into both outputs, with the way it's setup now, because you can't garuantee that the two outputs are setup for consecutive location writes?

MicahVillmow · ‎02-01-2010

Burst writing only happens to sequential addresses. Where the memory exists doesn't matter, as long as the addresses are sequential, then the writes can be bursted. OpenCL can't interleave pointers since pointers can be aliased and point to the exact same memory location. Hope this helps.

vic20 · ‎02-05-2010

The memory of my Radeon 5750 is not completely detected. Here is a part of the output of CLInfo:

...

Global memory size:               268435456
Constant buffer size:               65536
Max number of constant args:           8
Local memory type:               Scratchpad
Local memory size:               32768
....

Platform ID:                    0x7f28c5d9c890
Name:                        Juniper
Vendor:                    Advanced Micro Devices, Inc.
....

my config:

ubuntu 9.04 64 bits, AMD Athlon(tm) II X4 620 Processor

Is it normal ?

thanks.

nou · ‎02-05-2010

yes this is normal.

hey AMD will future release support more than 256MB of memory?

vic20 · ‎02-05-2010

other question: is it possible, with linux, to use the IGP of the mainboard for display in order to allocate the GPU only for computations ?

Maybe it is too much complicated at this stage of the driver development 😉

genaganna · ‎02-05-2010

Originally posted by: vic20 other question: is it possible, with linux, to use the IGP of the mainboard for display in order to allocate the GPU only for computations ?

Display should be connected to GPU.

Archives Discussions

GPU Global Memory