24 Replies Latest reply on Aug 20, 2011 11:31 AM by hayabusaxps

    memory optimization group organization optimization

    hayabusaxps
      which is faster, between organization of the local into workgroups vs the global given there is a large program

      lets say there is a global index of 128 by 128 by 128 cube matrix, there will be about 20 calculations on each of the 128^3 =2097152 values then 1 stored value, and this process will repeated 10,000 times.

      because im working with a huge number of calculations what i need to know is, is it benifietial to organize the cube matrix into workgroups and work-items in local memory

      OR

      should i just allow opencl to handle it all in global, which losses the fast local speeds but doesnt waste time with thousands of data copies?

       

        • memory optimization group organization optimization
          maximmoroz

          You say 20 flop per one output element?

          1) How much data do you need to read from global memory for this single element?

          2) Do input data overlap for adjustant output elements? If yes then how much data do you need to read from global memory for 8 (2x2x2) output elements?

            • memory optimization group organization optimization
              hayabusaxps

              say we have the equation

               

               

              DMX=lam*((My*Bz)-(Mz*By))Mx/T

              now picture a cube of 128 elements by 128 elements by 128 elements for a total of 2097152 elements

              each element has properties of Mx,my,mz that are rewritten at each repetition and there are like 10,000 repetition i.e. 10,000 samples each sample is dependant on the previous sample.

              1) How much data do you need to read from global memory for this single element?

              1A) for the single element i will need to read a double float from global of approximately 8 values. but only 3 are dependant on the previous calculation of the element.

              2) Do input data overlap for adjustant output elements? If yes then how much data do you need to read from global memory for 8 (2x2x2) output elements?

              2A) for a global memory of 2x2x2 output elements only one variable out of 8 will overlapp. 

               

               

                • memory optimization group organization optimization
                  maximmoroz

                  So you have 128x128x128x3x10000 virtual elements V. May I ask to clearly, without any omissions, specify the equation: How do you calculate V(t+1)? For example:

                  V(i, j, k, 0, t+1) = V(i, j, k, 0, t) * V(i, j, k, 1, t) / V(j, j, k, 2, t) - V(i +1, j, k, 0, t) * W(j, j, k) * P

                  V(i, j, k, 1, t+1) = V(i, j, k, 1, t) * V(i, j, k, 2, t) / V(j, j, k, 0, t) - V(i +1, j, k, 1, t) * S(j, j, k) * P

                  V(i, j, k, 2, t+1) = V(i, j, k, 1, t) * V(i, j, k, 0, t) / V(j, j, k, 2, t) + V(i +1, j, k, 1, t) / S(j, j, k) * P

                  where:

                  i, j, k - indexes in spatial dimensions (0-127)

                  the 4th argument - index of the property (0-2)

                  the 5th argument, t - iteration

                  W, S - some constant buffers, which contains values specific to spatial dementions (i, j, k)

                  P - constant

                  Okay, I am not event asking for the exact equation. Actually, I don't need operations, I don't care (for the time being) what you do with operands. Just list all the operands.

                    • memory optimization group organization optimization
                      hayabusaxps

                      the best way i visualize this is in a set of 3 cubes of virtual elements that has 10,000 iterations utilizing the 5th arguement.

                      if we skip the maximum memory allocation barrier when storing this many cubes i would program in the first iteration with a function like.

                      int cell_position =  get_global_id(0)*get_global_size(1)* get_global_size(2)*components_vector + get_global_id(1)*get_global_size(2)*components_vector +get_global_id(2)*components_vector;

                      this would allow me to use the components_vector to specify the 4th arguement as 3 separeate cubes of virtual elements, then i could just create 2 buffers of iterations to minimize the amount of memory.

                      but back to your question 

                      1) Okay, I am not event asking for the exact equation. Actually, I don't need operations, I don't care (for the time being) what you do with operands. Just list all the operands.

                      V(i, j, k, 0, t+1) = V(i, j, k, 0, t) * V(i, j, k, 1, t) / V(j, j, k, 2, t) - V(i +1, j, k, 0, t) * W(j, j, k) * P

                      here we have operands

                      V(i, j, k, 0, t)    primary operand V, because of its size i would allocate in global or local based on the size required for memory

                      V(i, j, k, 1, t)      same

                      V(j, j, k, 2, t)      similiar except that we are using a warped set of indices for the function

                      V(i +1, j, k, 0, t)  same as above.

                      W(j, j, k)        her we have a set of constants, most likely put into private or local because it will overlap further calculations

                       P                 again similiar to w but a smart idea would be to allocate in private

                      V(i, j, k, 1, t+1) = V(i, j, k, 1, t) * V(i, j, k, 2, t) / V(j, j, k, 0, t) - V(i +1, j, k, 1, t) * S(j, j, k) * P

                      V(i, j, k, 2, t+1) = V(i, j, k, 1, t) * V(i, j, k, 0, t) / V(j, j, k, 2, t) + V(i +1, j, k, 1, t) / S(j, j, k) * P

                      the 2 above are the same idea

                       

                      BTW thank you very much for the time you are taking to help me out.

                       

                        • memory optimization group organization optimization
                          maximmoroz

                          Hmmm, I gave the formula just as an example, randomly picked up parameters, and you changed it in a very small degree. In order to understand whether it makes sense to use local memory or not we need to know what elements of buffer V are required to calculate V(i, j, k, 0, t+1), V(i, j, k, 1, t+1) and V(i, j, k, 2, t+1).

                          Is it just 3 "previous" elements V(i, j, k, 0, t), V(i, j, k, 1, t) and V(i, j, k, 2, t)? Or some adjustant elements of buffer V?

                            • memory optimization group organization optimization
                              hayabusaxps

                              the way my program is designed right now there are something like

                              128x128x128x13x iteration

                                  /* INPUTS POSITION
                                   *
                                   * M0   0
                                   * T1   1
                                   * T2   2
                                   * lam  3
                                   * X    ,  Y    ,  Z            {4,5,6}
                                   * Mx0  ,  My0  ,  Mz0          {7,8,9}
                                   * Bx0  ,  By0  ,  Bz0          {10,11,12}
                                   */

                               

                              instead of 3 in the global,

                              what im thinking is

                              local for

                              * M0 0
                              * T1 1
                              * T2 2
                              * lam 3

                              and global for

                              * Mx0 , My0 , Mz0 {7,8,9}
                              * Bx0 , By0 , Bz0 {10,11,12}

                               

                              because i am using the M(i,j,k,x,y,z) ~ nuclear magnitization vector of the hydrogen molecule and then B for the magnetic field. after each iteration the M will change based on the B(i,j,k,x,y,z).

                              but each i,k,j cell will have overlapping constants like the T1,T2, and gyromagneticRatio. although i want each individual cell to be independant so there not really constants be not all cells are identical. which is exactly like your W,S constant buffers.

                               

                               

                              lamens terms what trying to convey to me is that only to put values that are overlapping in the iteration and between the cells(i,j,k) (spatial) in local? is that correct.

                               

                                • memory optimization group organization optimization
                                  hayabusaxps

                                  also the

                                  M_initial(iteration)  => calculation => M_final(iteration +1)

                                  M_initial(iteration+1) => calculation => M_final(iteration +2)

                                   

                                  • memory optimization group organization optimization
                                    maximmoroz

                                    Ok, I looked up what is NMR is in Wikipedia. But I am not physics in any way.

                                    So M(x,y,z), magnetization vector for the specific point (i,j,k) for the iteration (t+1) is calculated basing on the value of this vector for the same point in previous iteration and basing on the value of external magnetic field B at this point (i,j,k) and parameters T1,T2, and gyromagneticRatio, again at this point (i,j,k) (these parameters don't change through each iterations).

                                    But it would be too easy. I am certainly missing something here.

                                    Maybe there is another dynamic input here, for example M also depends on the values of this M of the previous operation of adjustant points? Check: would equations remain the same if we have 1D 2,097,152 elements instead of 3D 128x128x128?

                                    Or B is changing too each iteration? Check: How B differs from T1, T2 and gyrRatio from calculation point of view, for example?

                                      • memory optimization group organization optimization
                                        hayabusaxps

                                        well in a mri there are usually millions of these little data points all being processed in parralel well more like the summation of the hydrogen atoms magnetic vectors (bc thier spinning) create a signal that we can measure

                                        1)values of this M of the previous operation of adjustant points

                                        the concept of the mri is fairly basic its the pulse sequence required to create an echo with usable information that can get tricky, yes the molecule -molecule dephasing isnt present to create distortion but wikipedias idea should give you a good idea of which elements are being used.

                                        2) would equations remain the same if we have 1D 2,097,152 elements instead of 3D 128x128x128?

                                        yes the equations would not change because they are still based on a Xdimension -position and 3D-vector  , the only difference is that they would not be getting a reading for a 3d space just a line.

                                         

                                        3)Or B is changing too each iteration?

                                        yes, each individual (i,j,k) element in M(i,j,k) has a slightly different B, this is what creates the echo, when used at a specific oscillation (larmar freq) thus we also have a changing B on each iteration.

                                          • memory optimization group organization optimization
                                            hayabusaxps

                                            also the automatic workgrouping changes i believe when the card processes  3 dimensional vs 1 dimensional.

                                            i believe i saw a huge increase in utilization when using a 3d calculation space over a 1 dimensional space, but really it shouldnt matter as long as your repeating the kernal with different input positions.

                                            although you have a point there because when you encode the stream into a cube it requires more processing.

                                              • memory optimization group organization optimization
                                                maximmoroz

                                                > although you have a point there because when you encode the stream into a cube it requires more processing

                                                Well, no. I asked about 1D possibility just to ensure that M(i,j,k) are independent on one another.

                                                > each individual (i,j,k) element in M(i,j,k) has a slightly different B, this is what creates the echo, when used at a specific oscillation (larmar freq) thus we also have a changing B on each iteration

                                                How B(i,j,k) is calculated at each step?

                                                  • memory optimization group organization optimization
                                                    hayabusaxps

                                                    something like this

                                                    3.1) B(r,t) = ((B0 + G(r,t)) * k) + (B1(t) * cos(localFrequency * t) * i) + (B1(t) * sin(localFrequency * t) * j);

                                                    Where,

                                                    • B(r,t) is the magneticField vector generated by the MRI equipment(in units of Tesla)
                                                    • B0 is the static magnetic field
                                                    • G(r,t) is the Gradient field
                                                    • B1(t) is the RF field
                                                    • i, j, k are the unit vectors in the x, y, z directions respectively.
                                                    • localFrequency: set to the larmorFrequency (see equation 3.2) of the desired slice (figure 3.2, table 3.2)
                                                    • r: position (in 3-D space)
                                                    • t: time

                                                    3.2) larmorFrequency = gyromagneticRatio * B0. (example: if B0 = 1.5 tesla, the larmorFrequency = 63.855 MHz);

                                                      • memory optimization group organization optimization
                                                        maximmoroz

                                                        Is there dependency from M(x,y,z) here? Through G(r,t) or B1(t)?

                                                          • memory optimization group organization optimization
                                                            hayabusaxps

                                                            yes, M(i,j,k,t) is dependant on B(i,j,k,t)   which is comprised of B0(i,j,k,t) , G(i,j,k,t) , and B1(i,j,k,t).

                                                              • memory optimization group organization optimization
                                                                maximmoroz

                                                                *crying*

                                                                Do you have in ANY of these numerous equations the one which introduces dependency between different points (i,j,k)? If there is no such equation then it is possible to traverse all 10,000 iterations for some point (i,j,k) without doing any caclulations for points located at different spatial dimentions, right?

                                                                  • memory optimization group organization optimization
                                                                    hayabusaxps

                                                                    correct, im hoping that i can lump say 100 iterations in the kernal because the spatial dimensions are independant, which will reduce the interval between cpu activation times which was slowing my previous code down.

                                                                      • memory optimization group organization optimization
                                                                        maximmoroz

                                                                        Why 100? Launch all 10,000. Actually, your task just perfectly fits GPU.

                                                                        1) Load all initial values (M, B), all coefficients from global memory to private. Set index to 0.

                                                                        2) Calculate M(t+1) and B(t+1) basing on current M and B located in private memory.

                                                                        3) Set current M and B to those newly calculated M(t+1) and B(t+1)

                                                                        4) Increment index. Go to step 2 if iteration index is less than 10,000

                                                                        5) Write M and B to global memory.

                                                                        So you access global memory only at steps 1 and 5. All intermediate calculations are done with values in private memory (registers). That should be amazingly fast if you stick to basic ALU optimization technics.

                                                                          • memory optimization group organization optimization
                                                                            hayabusaxps

                                                                            speed changed from cpu = 30 min to gpu =30 seconds.

                                                                            yea thats what i was thinking ill try and get an overview of exactly what im planning would you for the private and local and global, so far ive been seeing some really insane speeds but i failed it run  everything in the gpu so i was causing cpu spikes at the kernal was iterating. 

                                                                            another thing is im working with double floating point numbers but when i converted from float to double float i started getting values like -inf or infinity am i using the correct printf or do you have any suggestions.

                                                                              • memory optimization group organization optimization
                                                                                hayabusaxps

                                                                                btw thankyou verry much for helping me out i dont have anybody around me that even knows of gpu coding or gpgpu coding so its been a huge help to actually have someone to coverse with lol.

                                                                                saturday and sunday ill be out of town but ill gather up a few questions for you on monday so then when ever your back online you will be able to see them.

                                                                                 

                                                                                Thank you very much.

                                                                                • memory optimization group organization optimization
                                                                                  maximmoroz

                                                                                  As you described the problem you don't need local memory as far as you need neither synchronization between work-items nor global memory access reduction. Execution speed should be near theoretical GPU perfromance (well, about 30-70% depending on your exact formulas and optimizations applied).

                                                                                  I didn't use double precision calculations, so I cannot give you many advices here. The only thing worth mentioining is that APP AMD SDK 2.5 replaced cl_amd_fp64 extention with cl_khr_fp64 for 58xx devices.

                                                                                  Well, I don't have peers to discuss GPU computing also. Forum discussions help a lot, ahaha. But if you want to efficiently progress in GPU Computing field you need to self-educate. In my particular case it is programming guides that helped me a lot. The ones produced by GPU manufacturers. You know they have a good motivation to produce high quality programming guides. AMD wrote (and keeps up to date) especially good one: AMD APP OpenCL Programming Guide.

                                                                                  Sure, Ask your questions, I will try to answer what I can.

                                                                                    • memory optimization group organization optimization
                                                                                      hayabusaxps

                                                                                      concept here is to say
                                                                                      stream into global the basic 3dimensional input.
                                                                                      transfer the   
                                                                                        cell properties into     mapped to  constant memory elements  8kB/CU  4-16b/cycle
                                                                                        cell magnetic calculations   mapped to  local    32kB/CU  8b/cycle
                                                                                        cell magnetic momentum vector  mapped to  private    256kB/CU 48b/cycle

                                                                                        global stream holds all input in a 3D cube with each cell housing all elements
                                                                                        -input  stream  = all elements
                                                                                        -output stream  = time dependant summation of magnetic momentum vectors.

                                                                                       


                                                                                      because of the sizing constraints i would like to run say 1000 iterations at a time because the smallest memory is local 8B per double element so 1000 iterations will leave me a max of 32 allocations

                                                                                      and 32 local
                                                                                      and 256 private


                                                                                      process
                                                                                      stream in virtual space from host.
                                                                                      load 20 cells characteristics into memory locations
                                                                                      calculate the magnetic fields sectioned by location
                                                                                      calculate the magnetic moment vector sectioned by location
                                                                                      1000 iterations for 20 cells

                                                                                      save magnetic vectors in global
                                                                                      load magnetic vectors     sectioned by iteration so that we can summation.
                                                                                      calculate the summation

                                                                                      load back into global output stream.

                                                                                      load next 1000 iterations and repeat entire process so that we dont waste memory.


                                                                                      loop untill 10M samples has been processed.


                                                                                      send to host memory.

                                                                                        • memory optimization group organization optimization
                                                                                          hayabusaxps

                                                                                          my last question is how do i print a double floating point in c++

                                                                                           

                                                                                          also allocating lds and private and constants just require me to use

                                                                                           

                                                                                          __private

                                                                                          __local

                                                                                          const

                                                                                          correct?

                                                                                          • memory optimization group organization optimization
                                                                                            maximmoroz

                                                                                            > because of the sizing constraints i would like to run say 1000 iterations at a time because the smallest memory is local 8B per double element so 1000 iterations will leave me a max of 32 allocations

                                                                                            I don't understant why do you need additional local memory for each iteration as far as for calculating values for iteration t+1 you need values from iteration t and don't need values from earlier iterations (t-1, t-2 e t.c.).

                                                                                            Did you read OpenCL Programming Guide? That's what books are written for - one write the book, a lot of others read it.

                                                                                              • memory optimization group organization optimization
                                                                                                hayabusaxps

                                                                                                Did you read OpenCL Programming Guide?

                                                                                                Yes, i did and i agree with the 1000 iterations at a time because then if i utilize both the local and private i should achieve better bandwidth, after viewing the local and private max memory it seems that the fastest method would be to shrink the size to the number of CUs and run a larger number of iterations, then once the 1000 ietations has been achieved if i then move it back into global and load new values from global i should be able to minimize the cpu activity.

                                                                                                actually im going to store all of the iterations so that i can achieve a echo.