Archives Discussions

drstrip · ‎03-07-2010

I know that the OpenCL API allows me to determine whether a device provides (true) local memory, but I can't find a comparable query about private memory. Is it assumed that a device with true local memory always supports local private memory as well? Conversely, is it also true that a device without local memory never has local private memory?

Fr4nz · ‎03-07-2010

Originally posted by: drstrip I know that the OpenCL API allows me to determine whether a device provides (true) local memory, but I can't find a comparable query about private memory.

In fact in OpenCL it is not possibile to determine if there's a real private memory on the device or if it's emulated...

Is it assumed that a device with true local memory always supports local private memory as well? Conversely, is it also true that a device without local memory never has local private memory?

Private and local memory are two different things: the first one must represent the stack of a kernel, the second one a very fast (and small) memory to use when you manage small and heavily reused data or you want to "alleviate" the pressure on global memory when reading/writing lots of times.

At the moment on current AMD OpenCL implementation private memory is mapped onto global memory. On 5xxx series global memory is cached thru texture caches, so there are benefits when using private variables anyway (obviously these variables mustn't be too big and should be preferably reused by work-items).

If I'm not wrong, in future OpenCL releases AMD plans to map private memory onto SIMD engine registers (IIRC we have 2kB per SIMD engine), but only AMD staff can answer precisely to your question.

nou · ‎03-07-2010

what i kno is that simple variables like float,int are mapped to registers. but private arrays are mapped to global memory. but AMD developers are working on that arrays will be in resgiters too.

Fr4nz · ‎03-07-2010

Originally posted by: nou what i kno is that simple variables like float,int are mapped to registers. but private arrays are mapped to global memory. but AMD developers are working on that arrays will be in resgiters too.

Oh, simple variables are already mapped onto registers? Really?? Where did you know that? If true, it's very nice!

Anyway, nice to hear that about arrays!

And what about texture cache? Will it be still used in the future as cache for global memory (VTEX) or it will be used another type of cache?

drstrip · ‎03-07-2010

To me, this whole discussion points out a shortcoming of the current OpenCL spec. Efficient algorithms require knowledge of not just whether local and private memory are implemented differently from global, but also the relative access times. The Sobel filter example copies a submatrix to local memory for use in the work group. Since the local memory is just mapped back to global, these copies are completely wasted, resulting in slower execution. Even if there were true local memory, unless it's access time is enough faster than global memory, these copies remain wasted.

Fr4nz · ‎03-07-2010

Originally posted by: drstrip To me, this whole discussion points out a shortcoming of the current OpenCL spec. Efficient algorithms require knowledge of not just whether local and private memory are implemented differently from global, but also the relative access times. The Sobel filter example copies a submatrix to local memory for use in the work group. Since the local memory is just mapped back to global, these copies are completely wasted, resulting in slower execution. Even if there were true local memory, unless it's access time is enough faster than global memory, these copies remain wasted.

Well, on 5xxx videocards local memory is mapped onto local memory. Only 4xxx cards emulate it on global memory.

Raistmer · ‎03-07-2010

Hm....
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 134217728
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Global
Local memory size: 16384

Point of interest bolded. It seems there is the way to know if local memory emolated via global or not.
(or maybe I misunderstood word "Global" here? .... )

drstrip · ‎03-07-2010

My question asks whether private memory is emulated. Look at the top post - I start by pointing out that we can determine this for local memory.

And the fact that we have local memory does not address the question of relative access time, at least as far as being able to determine this via the OpenCL API.

nou · ‎03-07-2010

well that is quite dependent on implemetation and even on kernel code. AMD implementation use register for single value of buildint variables. but if you use array of private memory for example int a[10]; then it store to global memory.

on nvidia card are registers used as private memory. but if you use too much of privae memory it begin store to global memory too.

Fr4nz · ‎03-07-2010

Originally posted by: nou well that is quite dependent on implemetation and even on kernel code. AMD implementation use register for single value of buildint variables. but if you use array of private memory for example int a[10]; then it store to global memory.

on nvidia card are registers used as private memory. but if you use too much of privae memory it begin store to global memory too.

By chance, do you know how much is the size of these registers on both ATI and Nvidia? Maybe 2kB per SIMD engine (so 2kB per thread-stack)?

nou · ‎03-07-2010

5870 : 256 x 64 x 128bit x 20SIMD = 5.24 MB

4870 : 256 x 64 x 128bit x 10SIMD = 2.62 MB

nvidia have smaller registers.

Fr4nz · ‎03-07-2010

Originally posted by: nou 5870 : 256 x 64 x 128bit x 20SIMD = 5.24 MB

4870 : 256 x 64 x 128bit x 10SIMD = 2.62 MB

nvidia have smaller registers.

Holy god, this is a LOT of private memory! Much more than local memory!

Moreover registers are at least as fast as local memory, right? Let's hope that ATI implement private memory also for small arrays soon...

drstrip · ‎03-07-2010

Do these numbers breakdown like this?:

256 - wave front size

64 - registers per thread

128 bit - register size.

If I have an int, does that take an entire register, or just 4 bytes worth?

n0thing · ‎03-08-2010

No it is like this -

256 - Registers per thread

64 - Wavefront-size

Each register is 128-bit so if you have an int than that takes up an entire register. So its better to vectorize your algorithm so that it maps to the underlying hardware.

You can see the number of registers used by your kernel in the ISA, look for the variable SQ_PGM_RESOURCES:NUM_GPRS at the bottom.

drstrip · ‎03-08-2010

Originally posted by: n0thing

You can see the number of registers used by your kernel in the ISA, look for the variable SQ_PGM_RESOURCES:NUM_GPRS at the bottom.

Sorry, what is the ISA? I thought it stood for Instruction Set Architecture.

Can I access this variable with clGetProgramBuildInfo? I found nothing in the build log.

jcpalmer · ‎03-08-2010

I agree more DeviceInfo queries for the future can help programs decide what might be better at run or preprocessor time, by sneaking a #define in. Private memory looks like an area to do that.

Uniform API based kernel info about the # of registers being used would also be good. Right now a good proxy is Max WorkGroup Size.

FYI, I am kind of shut out for the time being on this platform due to my use of images. But I also know how private memory location is controllable on Nvidia's implementation through an undocumented compile option that slipped out, -cl-nv-maxrregcount=nn.

It is useful on their platform to have a big work group size to hide the fixed latency when reading images. When you have a kernel that uses a large # of registers, forcing some out to global is a tradeoff that could pay. BUT, when trying to compile on OSX this generates an error, not a warning. A facility to specify which action to take would be good.

nou · ‎03-08-2010

@jcpalmer: this is something for future version of OpenCL

@drstrip: you can get ISA and IL of your kernel when you set eviroment variable GPU_DUMP_DEVICE_KERNEL=3 or use Stream Kernel Analyzer.

drstrip · ‎03-08-2010

Originally posted by: nou

@drstrip: you can get ISA and IL of your kernel when you set eviroment variable GPU_DUMP_DEVICE_KERNEL=3 or use Stream Kernel Analyzer.

Thanks - that's just what I needed to know.

eduardoschardong · ‎03-08-2010

A question about indexed register access, for when it will be used for private arrays (already in use for DX right?)

How the register file will be accessed? I mean, if each thread use an unique index there maybe 64 different registers, how fast they will be ready from the register file? execution will halt until it finishes like when there is bank conflicts in LDS?

_Big_Mac_ · ‎03-08-2010

For NVIDIa GPUs it's either 8192 or 16384 32bit scalar registers per multiprocessor, depending on the card's compute capability (since gt200 it's 16384). Double words take up two registers.

So, that's up to 1.875 MB per device (for GTX 285 or the baddest Tesla). That's also more than their local memory space (currently 16KB per multiprocessor, 4x less).

NVIDIA GPUs have a smaller register file but they use scalar registers and the pool is shared, so that's not directly comparable. Ex. on NVIDIA cards you can do trade-offs with using more registers per thread or having more threads per multiprocessor - I'm not sure if that's how it works on ATI cards. Also, 32 bit variables don't take up 128 bits so it's much more efficient for the scalar programming style they're trying to emulate.

As for arrays, this is tricky.

On NVIDIA cards, arrays defined "just like that" (stack) will map to registers only if they are small and indexing is performed entirely by literals (ie. a[1], never a). Otherwise they end up in private memory (effectively global memory). It's difficult to do dynamic indexing on registers - if the compiler must assume the "i" in a[ i ] is dynamic it gives up and puts "a" into a kind of memory where it can use pointer arithmetics.

Alternatively you can use local memory for storing an array at which point you can do pointer arithmetics. But naturally the semantics change, the array is now shared by all work-items in the group, it's not private storage anymore.

davibu · ‎03-08-2010

Originally posted by: nou well that is quite dependent on implemetation and even on kernel code. AMD implementation use register for single value of buildint variables. but if you use array of private memory for example int a[10]; then it store to global memory.

Aren't they using local memory for array ? I remember to have read something along this line in a post on this forum.

I have done some test in the past and I was getting the same performance by using array or local memory.

nou · ‎03-08-2010

well on 5xxx are read from global memory cached.so small arrays fit to this cache so IMHO there is only small performance penality. but you can see VFETCH and MEM STORE instruction if you use private arrays in ISA code.

Fr4nz · ‎03-08-2010

Originally posted by: nou well on 5xxx are read from global memory cached.so small arrays fit to this cache so IMHO there is only small performance penality. but you can see VFETCH and MEM STORE instruction if you use private arrays in ISA code.

What about using private int4/uint4/float4 variables (not arrays)? Are they stored in registers like scalar variables?

nou · ‎03-08-2010

yes int4 and other vectors are in registers.

MicahVillmow · ‎03-08-2010

Franz,
The same register can be accessed on every instruction block, so there is no real latency issues. The main issue is with port restrictions to the register file, which is explained in the ISA doc.

Archives Discussions

private vs. local memory