24 Replies Latest reply on Mar 8, 2010 8:15 PM by _Big_Mac_

    private vs. local memory

    drstrip

      I know that the OpenCL API allows me to determine whether a device provides (true) local memory, but I can't find a comparable query about private memory. Is it assumed that a device with true local memory always supports local private memory as well? Conversely, is it also true that a device without local memory never has local private memory?

       

        • private vs. local memory
          Fr4nz

           

          Originally posted by: drstrip I know that the OpenCL API allows me to determine whether a device provides (true) local memory, but I can't find a comparable query about private memory.


          In fact in OpenCL it is not possibile to determine if there's a real private memory on the device or if it's emulated...

           

          Is it assumed that a device with true local memory always supports local private memory as well? Conversely, is it also true that a device without local memory never has local private memory?


          Private and local memory are two different things: the first one must represent the stack of a kernel, the second one a very fast (and small) memory to use when you manage small and heavily reused data or you want to "alleviate" the pressure on global memory when reading/writing lots of times.

          At the moment on current AMD OpenCL implementation private memory is mapped onto global memory. On 5xxx series global memory is cached thru texture caches, so there are benefits when using private variables anyway (obviously these variables mustn't be too big and should be preferably reused by work-items).

          If I'm not wrong, in future OpenCL releases AMD plans to map private memory onto SIMD engine registers (IIRC we have 2kB per SIMD engine), but only AMD staff can answer precisely to your question.

            • private vs. local memory
              nou

              what i kno is that simple variables like float,int are mapped to registers. but private arrays are mapped to global memory. but AMD developers are working on that arrays will be in resgiters too.

                • private vs. local memory
                  Fr4nz

                   

                  Originally posted by: nou what i kno is that simple variables like float,int are mapped to registers. but private arrays are mapped to global memory. but AMD developers are working on that arrays will be in resgiters too.

                   

                   

                  Oh, simple variables are already mapped onto registers? Really?? Where did you know that? If true, it's very nice!

                  Anyway, nice to hear that about arrays!

                  And what about texture cache? Will it be still used in the future as cache for global memory (VTEX) or it will be used another type of cache?

                    • private vs. local memory
                      drstrip

                      To me, this whole discussion points out a shortcoming of the current OpenCL spec. Efficient algorithms require knowledge of not just whether local and private memory are implemented differently from global, but also the relative access times.  The Sobel filter example copies a submatrix to local memory for use in the work group. Since the local memory is just mapped back to global, these copies are completely wasted, resulting in slower execution. Even if there were true local memory, unless it's access time is enough faster than global memory, these copies remain wasted.

                        • private vs. local memory
                          Fr4nz

                           

                          Originally posted by: drstrip To me, this whole discussion points out a shortcoming of the current OpenCL spec. Efficient algorithms require knowledge of not just whether local and private memory are implemented differently from global, but also the relative access times.  The Sobel filter example copies a submatrix to local memory for use in the work group. Since the local memory is just mapped back to global, these copies are completely wasted, resulting in slower execution. Even if there were true local memory, unless it's access time is enough faster than global memory, these copies remain wasted.

                           

                          Well, on 5xxx videocards local memory is mapped onto local memory. Only 4xxx cards emulate it on global memory.

                  • private vs. local memory
                    Raistmer
                    Hm....
                    Cache type: None
                    Cache line size: 0
                    Cache size: 0
                    Global memory size: 134217728
                    Constant buffer size: 65536
                    Max number of constant args: 8
                    Local memory type: Global
                    Local memory size: 16384

                    Point of interest bolded. It seems there is the way to know if local memory emolated via global or not.
                    (or maybe I misunderstood word "Global" here? .... )
                      • private vs. local memory
                        drstrip

                        My question asks whether private memory is emulated. Look at the top post - I start by pointing out that we can determine this for local memory.

                         

                        And the fact that we have local memory does not address the question of relative access time, at least as far as being able to determine this via the OpenCL API.

                          • private vs. local memory
                            nou

                            well that is quite dependent on implemetation and even on kernel code. AMD implementation use register for single value of buildint variables. but if you use array of private memory for example int a[10]; then it store to global memory.

                            on nvidia card are registers used as private memory. but if you use too much of privae memory it begin store to global memory too.

                              • private vs. local memory
                                Fr4nz

                                 

                                Originally posted by: nou well that is quite dependent on implemetation and even on kernel code. AMD implementation use register for single value of buildint variables. but if you use array of private memory for example int a[10]; then it store to global memory.

                                 

                                on nvidia card are registers used as private memory. but if you use too much of privae memory it begin store to global memory too.

                                 

                                By chance, do you know how much is the size of these registers on both ATI and Nvidia? Maybe 2kB per SIMD engine (so 2kB per thread-stack)?

                                  • private vs. local memory
                                    nou

                                    5870 : 256 x 64 x 128bit x 20SIMD = 5.24 MB

                                    4870 : 256 x 64 x 128bit x 10SIMD = 2.62 MB

                                    nvidia have smaller registers.

                                      • private vs. local memory
                                        Fr4nz

                                         

                                        Originally posted by: nou 5870 : 256 x 64 x 128bit x 20SIMD = 5.24 MB

                                         

                                        4870 : 256 x 64 x 128bit x 10SIMD = 2.62 MB

                                         

                                        nvidia have smaller registers.

                                         

                                        Holy god, this is a LOT of private memory! Much more than local memory!

                                        Moreover registers are at least as fast as local memory, right? Let's hope that ATI implement private memory also for small arrays soon...

                                          • private vs. local memory
                                            drstrip

                                            Do these numbers breakdown like this?:

                                            256 - wave front size

                                            64 - registers per thread

                                            128 bit - register size.

                                             

                                            If I have an int, does that take an entire register, or just 4 bytes worth?

                                              • private vs. local memory
                                                n0thing

                                                No it is like this -

                                                256 - Registers per thread

                                                64 - Wavefront-size

                                                Each register is 128-bit so if you have an int than that takes up an entire register. So its better to vectorize your algorithm so that it maps to the underlying hardware.

                                                You can see the number of registers used by your kernel in the ISA, look for the variable SQ_PGM_RESOURCES:NUM_GPRS at the bottom.

                                                  • private vs. local memory
                                                    drstrip

                                                     

                                                    Originally posted by: n0thing

                                                     

                                                    You can see the number of registers used by your kernel in the ISA, look for the variable SQ_PGM_RESOURCES:NUM_GPRS at the bottom.

                                                     

                                                    Sorry, what is the ISA? I thought it stood for Instruction Set Architecture.

                                                    Can I access this variable with clGetProgramBuildInfo? I found nothing in the build log.

                                                      • private vs. local memory
                                                        jcpalmer

                                                        I agree more DeviceInfo queries for the future can help programs decide what might be better at run or preprocessor time, by sneaking a #define in.  Private memory looks like an area to do that.  

                                                        Uniform API based kernel info about the # of registers being used would also be good.  Right now a good proxy is Max WorkGroup Size.

                                                        FYI, I am kind of shut out for the time being on this platform due to my use of images.  But I also know how private memory location is controllable on Nvidia's implementation through an undocumented compile option that slipped out, -cl-nv-maxrregcount=nn.

                                                        It is useful on their platform to have a big work group size to hide the fixed latency when reading images.  When you have a kernel that uses a large # of registers, forcing some out to global is a tradeoff that could pay. BUT, when trying to compile on OSX this generates an error, not a warning.  A facility to specify which action to take would be good.

                                                          • private vs. local memory
                                                            nou

                                                            @jcpalmer: this is something for future version of OpenCL

                                                            @drstrip: you can get ISA and IL of your kernel when you set eviroment variable GPU_DUMP_DEVICE_KERNEL=3 or use Stream Kernel Analyzer.

                                                              • private vs. local memory
                                                                drstrip

                                                                 

                                                                Originally posted by: nou

                                                                 

                                                                @drstrip: you can get ISA and IL of your kernel when you set eviroment variable GPU_DUMP_DEVICE_KERNEL=3 or use Stream Kernel Analyzer.

                                                                 

                                                                 

                                                                Thanks - that's just what I needed to know.

                                                                  • private vs. local memory
                                                                    eduardoschardong

                                                                    A question about indexed register access, for when it will be used for private arrays (already in use for DX right?)

                                                                    How the register file will be accessed? I mean, if each thread use an unique index there maybe 64 different registers, how fast they will be ready from the register file? execution will halt until it finishes like when there is bank conflicts in LDS?

                                                                     

                                                                • private vs. local memory
                                                                  _Big_Mac_

                                                                  For NVIDIa GPUs it's either 8192 or 16384 32bit scalar registers per multiprocessor, depending on the card's compute capability (since gt200 it's 16384). Double words take up two registers.

                                                                  So, that's up to 1.875 MB per device (for GTX 285 or the baddest Tesla). That's also more than their local memory space (currently 16KB per multiprocessor, 4x less).

                                                                  NVIDIA GPUs have a smaller register file but they use scalar registers and the pool is shared, so that's not directly comparable. Ex. on NVIDIA cards you can do trade-offs with using more registers per thread or having more threads per multiprocessor - I'm not sure if that's how it works on ATI cards. Also, 32 bit variables don't take up 128 bits so it's much more efficient for the scalar programming style they're trying to emulate.

                                                                  As for arrays, this is tricky.

                                                                  On NVIDIA cards, arrays defined "just like that" (stack) will map to registers only if they are small and indexing is performed entirely by literals (ie. a[1], never a). Otherwise they end up in private memory (effectively global memory). It's difficult to do dynamic indexing on registers - if the compiler must assume the "i" in a[ i ] is dynamic it gives up and puts "a" into a kind of memory where it can use pointer arithmetics.

                                                                  Alternatively you can use local memory for storing an array at which point you can do pointer arithmetics. But naturally the semantics change, the array is now shared by all work-items in the group, it's not private storage anymore.

                                                                   

                                                    • private vs. local memory
                                                      davibu

                                                       

                                                      Originally posted by: nou well that is quite dependent on implemetation and even on kernel code. AMD implementation use register for single value of buildint variables. but if you use array of private memory for example int a[10]; then it store to global memory.

                                                      Aren't they using local memory for array ? I remember to have read something along this line in a post on this forum.

                                                      I have done some test in the past and I was getting the same performance by using array or local memory.

                                                       

                                                        • private vs. local memory
                                                          nou

                                                          well on 5xxx are read from global memory cached.so small arrays fit to this cache so IMHO there is only small performance penality. but you can see VFETCH and MEM STORE instruction if you use private arrays in ISA code.

                                                            • private vs. local memory
                                                              Fr4nz

                                                               

                                                              Originally posted by: nou well on 5xxx are read from global memory cached.so small arrays fit to this cache so IMHO there is only small performance penality. but you can see VFETCH and MEM STORE instruction if you use private arrays in ISA code.

                                                               

                                                              What about using private int4/uint4/float4 variables (not arrays)? Are they stored in registers like scalar variables?

                                                    • private vs. local memory
                                                      MicahVillmow
                                                      Franz,
                                                      The same register can be accessed on every instruction block, so there is no real latency issues. The main issue is with port restrictions to the register file, which is explained in the ISA doc.