14 Replies Latest reply on May 11, 2010 7:10 PM by davibu

    How do the GPU cores access a common memory location like a Lookup Table?

    rotor

      Hi guys,

      I am using a a Lookup Table(LUT) for my kernel and just bump up with a question about how the GPU cores access a common memory location like a LUT?

      I am considering this scenario: I have a lookup table LUT which is located somewhere in global memory and during the execution time my threads repeatedly access this LUT to refer to some result (lets say refer to the LUT through 2D indexing LUT[j]). Now the question is: what happens if there are 100 cores(computational units) access this LUT at a same moment? Will the cores have to wait in a queue or smt like that to access this LUT since they are refer to the same memory object? If they don't have to wait in line for accessing LUT, how does the GPU memory controller handle this? especially in the case that 100 cores refers to 100 different element of the LUT, how does the GPU's address decoder unit work? Is there any special architecture inside the GPU that allows the cores to share the same memory bus?

      Thanks,

      Roto

        • How do the GPU cores access a common memory location like a Lookup Table?
          LeeHowes

          If you mean multiple SIMDs accessing the same addresses then either you cache or you don't. If you don't cache you'll queue at the memory controller, and while it is possible accesses will be combined at the interface it is unlikely because of the amount of traffic and the timing of requests. If you use texture caches, which you should really for this sort of data (given that at the moment plain buffers do not cache for reasons I can't really go in to) then you have read-only and hence non-coherent caching on a per-SIMD basis. Once a given SIMD makes a request to its own cache that is entirely independent of accesses from other SIMDs, so once the data is in there accesses no longer need hit the main memory controller.

          In a given wave running on a SIMD, if multiple elements access different addresses it will make n requests to the memory controller where n is the number of cache lines that are required to service the set of addresses. Multiple addresses that can be served by a given cache line will be serviced through the memory crossbar as the data comes out of the cache. Those addresses will queue relative to requests from other SIMDs.

          Does that answer your question?

            • How do the GPU cores access a common memory location like a Lookup Table?
              davibu

               

              Originally posted by: LeeHowes If you use texture caches, which you should really for this sort of data (given that at the moment plain buffers do not cache for reasons I can't really go in to)

              So, do you confirm access to normal memory buffers is totally un-cached at the moment ?

              This represent a formidable opportunity to boost performance of existing applications by switching to image buffers.

               

                • How do the GPU cores access a common memory location like a Lookup Table?
                  Fr4nz

                   

                  Originally posted by: davibu
                  Originally posted by: LeeHowes If you use texture caches, which you should really for this sort of data (given that at the moment plain buffers do not cache for reasons I can't really go in to)


                   

                  So, do you confirm access to normal memory buffers is totally un-cached at the moment ?

                   

                  This represent a formidable opportunity to boost performance of existing applications by switching to image buffers.



                  If I remember correctly, "regular" buffers should be cached through texture cache on 5xxx series.

                    • How do the GPU cores access a common memory location like a Lookup Table?
                      LeeHowes

                      In that as far as the hardware is concerned a normal buffer is a texture with no filtering and can be run through the texture fetch units with all the caching you'd expect, it should work. There are software reasons why, at this time, it does not in OpenCL (it should in DirectCompute). For the time being if you want caching, use images. It's not a bad idea in general anyway because they remove a level of pointer analysis from the compiler: the more info you can provide a compiler the more chance it can optimise.

                        • How do the GPU cores access a common memory location like a Lookup Table?
                          dominik_g

                          If I remember correctly, on NVIDIA GPUs constant memory is cached on-chip.

                          Is there something similar on AMD/ATI GPUs? Or is there no difference between constant and global memory?

                            • How do the GPU cores access a common memory location like a Lookup Table?
                              LeeHowes

                              I'll point you to Micah's comment on that. It's a work in progress:

                              http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=132843&enterthread=y

                              ETA: need to fix that URL, used to other forum software.

                                • How do the GPU cores access a common memory location like a Lookup Table?
                                  rotor

                                  Thank you very much LeeHowes. Your answer make me much more clear about this issue.

                                  So basically we should cached our LUT on SIMD shared memory to reduce memory traffic and conflict. However it would be tuffs if our LUT size is large so cannot fit to the SIMD cache. Additionally if we cache the LUT to shared memory so the size of LUT may have affect on the size of Workgroup (i.e. how many threads on a workgroup). LUT really safe us from unnecessary computations so I hope the GPU manufacturers may have some new architecture to solve this conflict problem .

                                  For all: I would like to bring this link provided by Byron in Developer Tool forum to here http://developer.amd.com/documentation/presentations/Pages/default.aspx. Please see the DirectCompute Performance by Nick Thibieros. It is very useful slide that address to lots of performance consideration.

                                  Again, thank you very much collaborative minds.

                                  Roto

                                    • How do the GPU cores access a common memory location like a Lookup Table?
                                      LeeHowes

                                      Yes, that's a pretty good presentation, though I'm slightly wary of the description of nvidia GPUs as scalar and AMD GPUs as vector - it's a bit of a poor way of thinking about the world to picture nvidia GPUs as scalar, and the difference between the two in programming terms needn't be very large. Unrolling a loop gets you as much ALU bonus as vectorisation on AMD hardware, and unrolling is just as beneficial for nvidia.

                                      I would say whether you use LDS or texture cache for your LUT depends on the algorithm. How much of the LUT will be used? If you're using, say, 10% of the LUT in a given kernel, loading the whole thing into LDS might be inefficient. If you are using 90% of it and using it regularly then LDS is probably a better option because it offers higher throughput than cache. As any architecture you have to optimise for the use case.

                                      I'm not sure what new features you're after, though. The only obvious thing to me is read/write caching, which is beneficial in some cases where it ends up being used as a near-infinitely large LDS.

                                      We have some other software and hardware improvements coming in the future that will help with this kind of problem. I doubt LDS will disappear any time soon because transistor for transistor it's vastly more efficient than hardware-controlled cache.

                                        • How do the GPU cores access a common memory location like a Lookup Table?
                                          davibu

                                          I'm trying to "encode" some read-only data inside an image in order to take advantage of cache as discussed in this topic.

                                          I encode in 7 CL_RGBA/CL_FLOAT pixels the following structure:

                                          typedef struct {
                                              float4 bboxes[2][3];
                                              int4 children;
                                          } QBVHNode;

                                          And I read back the data with something like:

                                                      const float4 bboxes_minX = read_imagef(nodes, imageSampler, (int2)(inx + bboxes_minXIndex, iny));
                                                      const float4 bboxes_maxX = read_imagef(nodes, imageSampler, (int2)(inx + bboxes_maxXIndex, iny));
                                                      const float4 bboxes_minY = read_imagef(nodes, imageSampler, (int2)(inx + bboxes_minYIndex, iny));
                                                      const float4 bboxes_maxY = read_imagef(nodes, imageSampler, (int2)(inx + bboxes_maxYIndex, iny));
                                                      const float4 bboxes_minZ = read_imagef(nodes, imageSampler, (int2)(inx + bboxes_minZIndex, iny));
                                                      const float4 bboxes_maxZ = read_imagef(nodes, imageSampler, (int2)(inx + bboxes_maxZIndex, iny));
                                                      const int4 children = as_int4(read_imagef(nodes, imageSampler, (int2)(inx + 6, iny)));

                                           

                                          Everything works fine for the 6 x float4 fields however I'm able to read the int4 filed if and only if the binary integer value correspond to a valid floating point values.

                                          It looks like read_imagef() is able to read only valid floating-point numbers (it doesn't just move the bits around).

                                          Is the only solution to encode the int4 fields in a separate image ? This would be quite annoying.

                                           

                                           

                          • How do the GPU cores access a common memory location like a Lookup Table?
                            MicahVillmow
                            If you want to read integers, please use read_imagei or read_imageui, not read_imagef.
                            • How do the GPU cores access a common memory location like a Lookup Table?
                              MicahVillmow
                              Lee,
                              The reason is that our GPU's don't support denorm's and thus would flush many values that are represented as normal int's to zero when using the floating point path. The correct way to do this is to read as unsigned integers and bitcast to floats, as that guarantee's that no formatting is done on the integer types.
                                • How do the GPU cores access a common memory location like a Lookup Table?
                                  davibu

                                  Thanks for the information.

                                  In my understanding of the OpenCL specs it is not correct to use read_imagei on an cl::ImageFormat(CL_RGBA, CL_FLOAT), am I wrong ? If I'm right, read_imagei is not an option.

                                  For the record, I split my data among multiple images (float4 fields stored in a CL_RGBA/CL_FLOAT and int4 stored in an CL_RGBA/CL_UNSIGNED_INT32). At the moment, storing data inside an image is not offering any performance improvement for me but I guess spreading data among multiple images doesn't help.