21 Replies Latest reply on Oct 10, 2008 11:31 PM by udeepta@amd

    Maximum 1D array

    bubu

      Is the maximum posible size of an 1D stream just 2^23? Well, I need to allocate a 1D-linear array with almost 384Mb... any idea how to perform this, pls?

       

      thx

        • Maximum 1D array
          bubu

          so... it's not possible then?

            • Maximum 1D array
              ryta1203
              As far as I know it's not possible in Brook+. Have you tried this in CAL?

              In Brook+ the max is 8192x8192 elements.
                • Maximum 1D array
                  bubu

                  Well I prefer to use brook+ and not CAL because I don't like the idea to work with low-level shaders. Do you know if brook`is going to extend the 1D arrays to, let's say, the full graphics card's VRAM? In CUDA I can allocate a really big 1D array without problems...

                    • Maximum 1D array
                      bubu

                      I see in the 1.2 docs that the maximum 1D array has 8192 elements ( 64M with virtualisation )... but I need to allocate 300 or 400Mb of linear data...

                      With CUDA I can use all the VRAM if it's needed ( using device memory pointers )... is there any way to allocate, for example, 500Mb of VRAM in brook+, pls? If not... when is this going to be supported?

                        • Maximum 1D array
                          udeepta@amd

                          You can allocate a 1D array in C/C++, pass the pointer to a 2D stream in Brook+. Inside the kernel, you can reconstruct the 1D index (index = row*width+column). The maximum 2D array allowed in Brook+ is 8192x8192. We are considering increasing this limit in future releases.

                            • Maximum 1D array
                              bubu

                               

                              Originally posted by: udeepta@amdWe are considering increasing this limit in future releases.

                              Yes, please... 64Mb is not enough. I just hope that won't be a hardware limit

                               

                              For a texture 8kx8k can be enough... but, in my case, I need to store a really big tree in VRAM.

                                • Maximum 1D array
                                  udeepta@amd

                                  We hear ya.

                                  In the mean time, will your algorithm work if you break the large linear data into a few smaller segments? The performance impact will not be very high. But i agree the limitation is something we could do without.

                                    • Maximum 1D array
                                      bubu

                                       

                                      Originally posted by: udeepta@amd

                                      In the mean time, will your algorithm work if you break the large linear data into a few smaller segments?

                                       

                                      It's for raytracing a 15M poly model using a kd-tree in the GPU ( I need a total of 700Mb for that... which should be ok using a 1Gb card ). I could be  break the kernels in parts... but it gonna be too difficult, so I think I'll just wait until you remove the 64Mb limitation. I need to polish some things meanwhile.

                                       

                                      Btw... other question... can be the StreamSDK's arrays virtualized like it's done with the system RAM? For example... imagine the graphics card the user is using has only 512Mb... Can be the other 256Mb I need got from the AGP/PCI memory?

                                      thx.

                                      • Maximum 1D array
                                        eduardoschardong
                                        Is the 8192x8192 a hardware limitation?
                                        If so, there is any possibility of future compilers avoiding it? Like, instead of 8192x8192 floats using 8192x8192 float4 but redirecting the last 2 bits to the corrected position? And/Or using multiple input/ouput streams and using the first bits to choose wich?
                          • Maximum 1D array
                            MicahVillmow
                            The 8192x8192 limitation is hardware and should be different with future hardware.
                              • Maximum 1D array
                                bubu

                                 

                                Originally posted by: MicahVillmow The 8192x8192 limitation is hardware and should be different with future hardware.


                                Could be possible to emulate it fast and automatically in the driver? I really need to see all the graphics card's VRAM as a linear 1D pointer for the current HW generation... seriously.

                                Once this is done, I would like to see a Firestream with 4Gb, like the NVIDIA Tesla C1060. The mesh data I need to manage can occupy more than 2Gb... that or use a virtual VRAM memory system through the AGP/PCI like the CPU does.... I think Vista can virtualise the VRAM.

                              • Maximum 1D array
                                MicahVillmow
                                bubu,
                                It is possible to get access to all the vram inside a kernel, but this requires using CAL, IL, and the global buffer(this gives you access to 32 bits of address space). There are problems with this, but these are operating system limitations mainly dealing with the mapping of graphics vram into the system pci memory space. There is usually a limit of around 200mb set by the driver/os that can be mapped directly at one point.

                                Memory management of the devices vram is left to the application writer as a single policy enforced by the driver is not ideal for all workloads.
                                  • Maximum 1D array
                                    bubu

                                     

                                    Originally posted by: MicahVillmow bubu, It is possible to get access to all the vram inside a kernel, but this requires using CAL, IL, and the global buffer(this gives you access to 32 bits of address space).


                                     

                                    But... can I use CAL to allocate the buffer and, then, pass the pointer to Brook+? I don't like the idea to program kernels in pseudo-assembly language. For example, this is what I want:

                                     

                                    byte* dataIn = (byte*)calAllocateBuffer(2*1024*1024);

                                    byte* dataOut = (byte*)calAllocateBuffer(1024);

                                     

                                    void myBrookPlusKernel ( byte *ptr )

                                    {

                                        const byte dIn = ptr[1800000000];

                                        dataOut[threadId] = dIn+10;

                                    }

                                      • Maximum 1D array
                                        udeepta@amd

                                        It is possible to write a kernel in Brook+, convert it to IL using the Brook+ compiler (brcc), and use the generated IL in CAL. Or better yet, put in the Brook+ kernel in ShaderAnalyzer and get the IL from there.

                                        It is not as straightforward as what you asked, but it is a good middle ground where you get all CAL functionality and the ease of Brook+ kernel programming. 

                                          • Maximum 1D array
                                            bubu

                                             

                                            Originally posted by: udeepta@amd It is possible to write a kernel in Brook+, convert it to IL using the Brook+ compiler (brcc), and use the generated IL in CAL. Or better yet, put in the Brook+ kernel in ShaderAnalyzer and get the IL from there.


                                             

                                            Ok... so imagine I need to access 256Mb of VRAM in a 512Mb Radeon.

                                            For example, let's fill that data with a value of 0.5f ( I know, a dumb kernel ).

                                            I should do this in Brook+ ( just a pseudo code ):

                                            [code]

                                            void kernel ( float buff[] )

                                            {

                                               buff[(blockIdx.x*blockDim.x)+threadId.x] = 0.5f;

                                            }

                                            [/code]

                                            Then I compile it using Shaderanalyzer or whatever... so the AMD IL assembler is generated.

                                             

                                            Then, In my C++ program I do:

                                             

                                            [code]

                                            //initialize CAL

                                            ...

                                            //Allocate 256Mb using CAL

                                            void *ptr = calMalloc(256*1024*1024);

                                             

                                            //Load the kernel compiled and pass the VRAM pointer as IL input.

                                            handle = calLoadKernel("c:\test\myKernel.il");

                                            calSetInput(handle,"buff",ptr);

                                            [/code]

                                             

                                             

                                              • Maximum 1D array
                                                bubu

                                                Is that the way, pls?

                                                  • Maximum 1D array
                                                    jean-claude

                                                    Hi Bubu,

                                                    Your point is a smart one.

                                                    Writing kernel in Brook in order to get an assembly program and then using CAL for overall memory management seems to me the best compromise.

                                                    What is needed from AMD is a short application note (les than one page) to provide guidance on how to proceed.

                                                    Could a support guy from AMD take this as an action?

                                                    I'm sure this would be of great (and easy) help to overcome some of the current Brook limits.

                                                    Thanks

                                                    Jean-Claude

                                                      • Maximum 1D array
                                                        udeepta@amd

                                                        To use IL with CAL, you will need to do some set up and mapping, as Micah explained. The CAL samples in the SDK show examples of that.

                                                        And you can go from Brook+ to IL easily. The issue is that brcc output IL is not compact, so the mapping in CAL can get murky, and I would not go that route now.

                                                        But we are working on cleaing up the brcc output IL for the next release -- am hoping that will help.

                                            • Maximum 1D array
                                              MicahVillmow
                                              bubu,
                                              This can be done using the C++ interface, but i'm not sure about the C interface. As I am more of a CAL engineer and not Brook+.
                                              • Maximum 1D array
                                                MicahVillmow
                                                Jean-Claude,
                                                One of the problem with that approach is that the code generated from brook requires a lot of setup and mapping that the brook+ runtime needs but you would not need if you were writing in CAL. Also the code that is generated is fairly difficult to read and there is a simpler approach that I used to use here at work.
                                                The approach is this:
                                                1) Break down all math instructions into the simplest form possible. I.e. (x = cos((y * z)/4)), make it x = y * z; x = x / 4; x = cos(x)
                                                This allows for an almost 1-to-1 mapping to IL for all math instructions. If something you want to do is not possible in IL, write up a quick hlsl shader and paste it into GPU shader analyzer and see the sequence of instructions that are generated.
                                                2) For all flow control statements, break the conditional statements into simpler statements and store the results in a variable.
                                                i.e. if(x == 0 || (y * z) > w) ==> x_cond = (x == 0); y = y * z; y_cond = (y > w); if_cond = (x_cond || y_cond); if (if_cond == 0) ...
                                                This is how you would write the code in IL, and you can debug it at a higher level and verify correctness and then do a 1-1 mapping to IL instructions
                                                3) move simple if statements into cmov_logical instructions. i.e. if(conditional) a = b else a = c ==> cmov_logical a, conditional, b, c
                                                4) Only use while(1) statements, instead of easier flow control, and place a if (if_cond == 0) break or if(if_cond == 0) continue. these translate easily to break_logicalz if_cond or continue_logicalz if_cond il instructions

                                                If you follow these guidelines, you can translate any brook+ source code that has been tested and debuged into CAL/IL fairly easily with almost no major issues outside of typo's or selecting the wrong instruction.