
memory optimization group organization optimization
maximmoroz Aug 9, 2011 9:46 PM (in response to hayabusaxps)You say 20 flop per one output element?
1) How much data do you need to read from global memory for this single element?
2) Do input data overlap for adjustant output elements? If yes then how much data do you need to read from global memory for 8 (2x2x2) output elements?

memory optimization group organization optimization
hayabusaxps Aug 10, 2011 3:51 PM (in response to maximmoroz)say we have the equation
DMX=lam*((My*Bz)(Mz*By))Mx/T
now picture a cube of 128 elements by 128 elements by 128 elements for a total of 2097152 elements
each element has properties of Mx,my,mz that are rewritten at each repetition and there are like 10,000 repetition i.e. 10,000 samples each sample is dependant on the previous sample.
1) How much data do you need to read from global memory for this single element?
1A) for the single element i will need to read a double float from global of approximately 8 values. but only 3 are dependant on the previous calculation of the element.
2) Do input data overlap for adjustant output elements? If yes then how much data do you need to read from global memory for 8 (2x2x2) output elements?
2A) for a global memory of 2x2x2 output elements only one variable out of 8 will overlapp.

memory optimization group organization optimization
maximmoroz Aug 10, 2011 4:34 PM (in response to hayabusaxps)So you have 128x128x128x3x10000 virtual elements V. May I ask to clearly, without any omissions, specify the equation: How do you calculate V(t+1)? For example:
V(i, j, k, 0, t+1) = V(i, j, k, 0, t) * V(i, j, k, 1, t) / V(j, j, k, 2, t)  V(i +1, j, k, 0, t) * W(j, j, k) * P
V(i, j, k, 1, t+1) = V(i, j, k, 1, t) * V(i, j, k, 2, t) / V(j, j, k, 0, t)  V(i +1, j, k, 1, t) * S(j, j, k) * P
V(i, j, k, 2, t+1) = V(i, j, k, 1, t) * V(i, j, k, 0, t) / V(j, j, k, 2, t) + V(i +1, j, k, 1, t) / S(j, j, k) * P
where:
i, j, k  indexes in spatial dimensions (0127)
the 4th argument  index of the property (02)
the 5th argument, t  iteration
W, S  some constant buffers, which contains values specific to spatial dementions (i, j, k)
P  constant
Okay, I am not event asking for the exact equation. Actually, I don't need operations, I don't care (for the time being) what you do with operands. Just list all the operands.

memory optimization group organization optimization
hayabusaxps Aug 11, 2011 3:12 PM (in response to maximmoroz)the best way i visualize this is in a set of 3 cubes of virtual elements that has 10,000 iterations utilizing the 5th arguement.
if we skip the maximum memory allocation barrier when storing this many cubes i would program in the first iteration with a function like.
int cell_position = get_global_id(0)*get_global_size(1)* get_global_size(2)*components_vector + get_global_id(1)*get_global_size(2)*components_vector +get_global_id(2)*components_vector;
this would allow me to use the components_vector to specify the 4th arguement as 3 separeate cubes of virtual elements, then i could just create 2 buffers of iterations to minimize the amount of memory.
but back to your question
1) Okay, I am not event asking for the exact equation. Actually, I don't need operations, I don't care (for the time being) what you do with operands. Just list all the operands.
V(i, j, k, 0, t+1) = V(i, j, k, 0, t) * V(i, j, k, 1, t) / V(j, j, k, 2, t)  V(i +1, j, k, 0, t) * W(j, j, k) * P
here we have operands
V(i, j, k, 0, t) primary operand V, because of its size i would allocate in global or local based on the size required for memory
V(i, j, k, 1, t) same
V(j, j, k, 2, t) similiar except that we are using a warped set of indices for the function
V(i +1, j, k, 0, t) same as above.
W(j, j, k) her we have a set of constants, most likely put into private or local because it will overlap further calculations
P again similiar to w but a smart idea would be to allocate in private
V(i, j, k, 1, t+1) = V(i, j, k, 1, t) * V(i, j, k, 2, t) / V(j, j, k, 0, t)  V(i +1, j, k, 1, t) * S(j, j, k) * P
V(i, j, k, 2, t+1) = V(i, j, k, 1, t) * V(i, j, k, 0, t) / V(j, j, k, 2, t) + V(i +1, j, k, 1, t) / S(j, j, k) * P
the 2 above are the same idea
BTW thank you very much for the time you are taking to help me out.

memory optimization group organization optimization
maximmoroz Aug 11, 2011 3:32 PM (in response to hayabusaxps)Hmmm, I gave the formula just as an example, randomly picked up parameters, and you changed it in a very small degree. In order to understand whether it makes sense to use local memory or not we need to know what elements of buffer V are required to calculate V(i, j, k, 0, t+1), V(i, j, k, 1, t+1) and V(i, j, k, 2, t+1).
Is it just 3 "previous" elements V(i, j, k, 0, t), V(i, j, k, 1, t) and V(i, j, k, 2, t)? Or some adjustant elements of buffer V?

memory optimization group organization optimization
hayabusaxps Aug 11, 2011 6:04 PM (in response to maximmoroz)the way my program is designed right now there are something like
128x128x128x13x iteration
/* INPUTS POSITION
*
* M0 0
* T1 1
* T2 2
* lam 3
* X , Y , Z {4,5,6}
* Mx0 , My0 , Mz0 {7,8,9}
* Bx0 , By0 , Bz0 {10,11,12}
*/instead of 3 in the global,
what im thinking is
local for
* M0 0
* T1 1
* T2 2
* lam 3and global for
* Mx0 , My0 , Mz0 {7,8,9}
* Bx0 , By0 , Bz0 {10,11,12}because i am using the M(i,j,k,x,y,z) ~ nuclear magnitization vector of the hydrogen molecule and then B for the magnetic field. after each iteration the M will change based on the B(i,j,k,x,y,z).
but each i,k,j cell will have overlapping constants like the T1,T2, and gyromagneticRatio. although i want each individual cell to be independant so there not really constants be not all cells are identical. which is exactly like your W,S constant buffers.
lamens terms what trying to convey to me is that only to put values that are overlapping in the iteration and between the cells(i,j,k) (spatial) in local? is that correct.

memory optimization group organization optimization
hayabusaxps Aug 11, 2011 6:07 PM (in response to hayabusaxps)also the
M_initial(iteration) => calculation => M_final(iteration +1)
M_initial(iteration+1) => calculation => M_final(iteration +2)

memory optimization group organization optimization
maximmoroz Aug 11, 2011 6:39 PM (in response to hayabusaxps)Ok, I looked up what is NMR is in Wikipedia. But I am not physics in any way.
So M(x,y,z), magnetization vector for the specific point (i,j,k) for the iteration (t+1) is calculated basing on the value of this vector for the same point in previous iteration and basing on the value of external magnetic field B at this point (i,j,k) and parameters T1,T2, and gyromagneticRatio, again at this point (i,j,k) (these parameters don't change through each iterations).
But it would be too easy. I am certainly missing something here.
Maybe there is another dynamic input here, for example M also depends on the values of this M of the previous operation of adjustant points? Check: would equations remain the same if we have 1D 2,097,152 elements instead of 3D 128x128x128?
Or B is changing too each iteration? Check: How B differs from T1, T2 and gyrRatio from calculation point of view, for example?

memory optimization group organization optimization
hayabusaxps Aug 12, 2011 12:02 PM (in response to maximmoroz)well in a mri there are usually millions of these little data points all being processed in parralel well more like the summation of the hydrogen atoms magnetic vectors (bc thier spinning) create a signal that we can measure
1)values of this M of the previous operation of adjustant points
the concept of the mri is fairly basic its the pulse sequence required to create an echo with usable information that can get tricky, yes the molecule molecule dephasing isnt present to create distortion but wikipedias idea should give you a good idea of which elements are being used.
2) would equations remain the same if we have 1D 2,097,152 elements instead of 3D 128x128x128?
yes the equations would not change because they are still based on a Xdimension position and 3Dvector , the only difference is that they would not be getting a reading for a 3d space just a line.
3)Or B is changing too each iteration?
yes, each individual (i,j,k) element in M(i,j,k) has a slightly different B, this is what creates the echo, when used at a specific oscillation (larmar freq) thus we also have a changing B on each iteration.

memory optimization group organization optimization
hayabusaxps Aug 12, 2011 12:06 PM (in response to hayabusaxps)also the automatic workgrouping changes i believe when the card processes 3 dimensional vs 1 dimensional.
i believe i saw a huge increase in utilization when using a 3d calculation space over a 1 dimensional space, but really it shouldnt matter as long as your repeating the kernal with different input positions.
although you have a point there because when you encode the stream into a cube it requires more processing.

memory optimization group organization optimization
maximmoroz Aug 12, 2011 12:28 PM (in response to hayabusaxps)> although you have a point there because when you encode the stream into a cube it requires more processing
Well, no. I asked about 1D possibility just to ensure that M(i,j,k) are independent on one another.
> each individual (i,j,k) element in M(i,j,k) has a slightly different B, this is what creates the echo, when used at a specific oscillation (larmar freq) thus we also have a changing B on each iteration
How B(i,j,k) is calculated at each step?

memory optimization group organization optimization
hayabusaxps Aug 12, 2011 12:50 PM (in response to maximmoroz)something like this
3.1) B(r,t) = ((B0 + G(r,t)) * k) + (B1(t) * cos(localFrequency * t) * i) + (B1(t) * sin(localFrequency * t) * j);
Where,
 B(r,t) is the magneticField vector generated by the MRI equipment(in units of Tesla)
 B0 is the static magnetic field
 G(r,t) is the Gradient field
 B1(t) is the RF field
 i, j, k are the unit vectors in the x, y, z directions respectively.
 localFrequency: set to the larmorFrequency (see equation 3.2) of the desired slice (figure 3.2, table 3.2)
 r: position (in 3D space)
 t: time
3.2) larmorFrequency = gyromagneticRatio * B0. (example: if B0 = 1.5 tesla, the larmorFrequency = 63.855 MHz);

memory optimization group organization optimization
maximmoroz Aug 12, 2011 1:17 PM (in response to hayabusaxps)Is there dependency from M(x,y,z) here? Through G(r,t) or B1(t)?

memory optimization group organization optimization
hayabusaxps Aug 12, 2011 1:59 PM (in response to maximmoroz)yes, M(i,j,k,t) is dependant on B(i,j,k,t) which is comprised of B0(i,j,k,t) , G(i,j,k,t) , and B1(i,j,k,t).

memory optimization group organization optimization
maximmoroz Aug 12, 2011 2:28 PM (in response to hayabusaxps)*crying*
Do you have in ANY of these numerous equations the one which introduces dependency between different points (i,j,k)? If there is no such equation then it is possible to traverse all 10,000 iterations for some point (i,j,k) without doing any caclulations for points located at different spatial dimentions, right?

memory optimization group organization optimization
hayabusaxps Aug 12, 2011 5:16 PM (in response to maximmoroz)correct, im hoping that i can lump say 100 iterations in the kernal because the spatial dimensions are independant, which will reduce the interval between cpu activation times which was slowing my previous code down.

memory optimization group organization optimization
maximmoroz Aug 12, 2011 5:35 PM (in response to hayabusaxps)Why 100? Launch all 10,000. Actually, your task just perfectly fits GPU.
1) Load all initial values (M, B), all coefficients from global memory to private. Set index to 0.
2) Calculate M(t+1) and B(t+1) basing on current M and B located in private memory.
3) Set current M and B to those newly calculated M(t+1) and B(t+1)
4) Increment index. Go to step 2 if iteration index is less than 10,000
5) Write M and B to global memory.
So you access global memory only at steps 1 and 5. All intermediate calculations are done with values in private memory (registers). That should be amazingly fast if you stick to basic ALU optimization technics.

memory optimization group organization optimization
hayabusaxps Aug 12, 2011 10:24 PM (in response to maximmoroz)speed changed from cpu = 30 min to gpu =30 seconds.
yea thats what i was thinking ill try and get an overview of exactly what im planning would you for the private and local and global, so far ive been seeing some really insane speeds but i failed it run everything in the gpu so i was causing cpu spikes at the kernal was iterating.
another thing is im working with double floating point numbers but when i converted from float to double float i started getting values like inf or infinity am i using the correct printf or do you have any suggestions.

memory optimization group organization optimization
hayabusaxps Aug 12, 2011 10:26 PM (in response to hayabusaxps)btw thankyou verry much for helping me out i dont have anybody around me that even knows of gpu coding or gpgpu coding so its been a huge help to actually have someone to coverse with lol.
saturday and sunday ill be out of town but ill gather up a few questions for you on monday so then when ever your back online you will be able to see them.
Thank you very much.

memory optimization group organization optimization
maximmoroz Aug 13, 2011 4:08 AM (in response to hayabusaxps)As you described the problem you don't need local memory as far as you need neither synchronization between workitems nor global memory access reduction. Execution speed should be near theoretical GPU perfromance (well, about 3070% depending on your exact formulas and optimizations applied).
I didn't use double precision calculations, so I cannot give you many advices here. The only thing worth mentioining is that APP AMD SDK 2.5 replaced cl_amd_fp64 extention with cl_khr_fp64 for 58xx devices.
Well, I don't have peers to discuss GPU computing also. Forum discussions help a lot, ahaha. But if you want to efficiently progress in GPU Computing field you need to selfeducate. In my particular case it is programming guides that helped me a lot. The ones produced by GPU manufacturers. You know they have a good motivation to produce high quality programming guides. AMD wrote (and keeps up to date) especially good one: AMD APP OpenCL Programming Guide.
Sure, Ask your questions, I will try to answer what I can.

memory optimization group organization optimization
hayabusaxps Aug 16, 2011 9:42 PM (in response to maximmoroz)concept here is to say
stream into global the basic 3dimensional input.
transfer the
cell properties into mapped to constant memory elements 8kB/CU 416b/cycle
cell magnetic calculations mapped to local 32kB/CU 8b/cycle
cell magnetic momentum vector mapped to private 256kB/CU 48b/cycleglobal stream holds all input in a 3D cube with each cell housing all elements
input stream = all elements
output stream = time dependant summation of magnetic momentum vectors.
because of the sizing constraints i would like to run say 1000 iterations at a time because the smallest memory is local 8B per double element so 1000 iterations will leave me a max of 32 allocationsand 32 local
and 256 private
process
stream in virtual space from host.
load 20 cells characteristics into memory locations
calculate the magnetic fields sectioned by location
calculate the magnetic moment vector sectioned by location
1000 iterations for 20 cellssave magnetic vectors in global
load magnetic vectors sectioned by iteration so that we can summation.
calculate the summationload back into global output stream.
load next 1000 iterations and repeat entire process so that we dont waste memory.
loop untill 10M samples has been processed.
send to host memory.
memory optimization group organization optimization
hayabusaxps Aug 16, 2011 9:43 PM (in response to hayabusaxps)my last question is how do i print a double floating point in c++
also allocating lds and private and constants just require me to use
__private
__local
const
correct?

memory optimization group organization optimization
maximmoroz Aug 17, 2011 8:51 AM (in response to hayabusaxps)> because of the sizing constraints i would like to run say 1000 iterations at a time because the smallest memory is local 8B per double element so 1000 iterations will leave me a max of 32 allocations
I don't understant why do you need additional local memory for each iteration as far as for calculating values for iteration t+1 you need values from iteration t and don't need values from earlier iterations (t1, t2 e t.c.).
Did you read OpenCL Programming Guide? That's what books are written for  one write the book, a lot of others read it.

memory optimization group organization optimization
hayabusaxps Aug 20, 2011 11:31 AM (in response to maximmoroz)Did you read OpenCL Programming Guide?
Yes, i did and i agree with the 1000 iterations at a time because then if i utilize both the local and private i should achieve better bandwidth, after viewing the local and private max memory it seems that the fastest method would be to shrink the size to the number of CUs and run a larger number of iterations, then once the 1000 ietations has been achieved if i then move it back into global and load new values from global i should be able to minimize the cpu activity.
actually im going to store all of the iterations so that i can achieve a echo.


















