cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

bubu
Adept II

Maximum 1D array

Is the maximum posible size of an 1D stream just 2^23? Well, I need to allocate a 1D-linear array with almost 384Mb... any idea how to perform this, pls?

 

thx

0 Likes
21 Replies
bubu
Adept II

so... it's not possible then?

0 Likes

As far as I know it's not possible in Brook+. Have you tried this in CAL?

In Brook+ the max is 8192x8192 elements.
0 Likes

Well I prefer to use brook+ and not CAL because I don't like the idea to work with low-level shaders. Do you know if brook`is going to extend the 1D arrays to, let's say, the full graphics card's VRAM? In CUDA I can allocate a really big 1D array without problems...

0 Likes

I see in the 1.2 docs that the maximum 1D array has 8192 elements ( 64M with virtualisation )... but I need to allocate 300 or 400Mb of linear data...

With CUDA I can use all the VRAM if it's needed ( using device memory pointers )... is there any way to allocate, for example, 500Mb of VRAM in brook+, pls? If not... when is this going to be supported?

0 Likes

You can allocate a 1D array in C/C++, pass the pointer to a 2D stream in Brook+. Inside the kernel, you can reconstruct the 1D index (index = row*width+column). The maximum 2D array allowed in Brook+ is 8192x8192. We are considering increasing this limit in future releases.

0 Likes

Originally posted by: udeepta@amdWe are considering increasing this limit in future releases.

Yes, please... 64Mb is not enough. I just hope that won't be a hardware limit

 

For a texture 8kx8k can be enough... but, in my case, I need to store a really big tree in VRAM.

0 Likes

We hear ya.

In the mean time, will your algorithm work if you break the large linear data into a few smaller segments? The performance impact will not be very high. But i agree the limitation is something we could do without.

0 Likes

Originally posted by: udeepta@amd

In the mean time, will your algorithm work if you break the large linear data into a few smaller segments?

 

It's for raytracing a 15M poly model using a kd-tree in the GPU ( I need a total of 700Mb for that... which should be ok using a 1Gb card ). I could be  break the kernels in parts... but it gonna be too difficult, so I think I'll just wait until you remove the 64Mb limitation. I need to polish some things meanwhile.

 

Btw... other question... can be the StreamSDK's arrays virtualized like it's done with the system RAM? For example... imagine the graphics card the user is using has only 512Mb... Can be the other 256Mb I need got from the AGP/PCI memory?

thx.

0 Likes

Is the 8192x8192 a hardware limitation?
If so, there is any possibility of future compilers avoiding it? Like, instead of 8192x8192 floats using 8192x8192 float4 but redirecting the last 2 bits to the corrected position? And/Or using multiple input/ouput streams and using the first bits to choose wich?
0 Likes

The 8192x8192 limitation is hardware and should be different with future hardware.
0 Likes

Originally posted by: MicahVillmow The 8192x8192 limitation is hardware and should be different with future hardware.


Could be possible to emulate it fast and automatically in the driver? I really need to see all the graphics card's VRAM as a linear 1D pointer for the current HW generation... seriously.

Once this is done, I would like to see a Firestream with 4Gb, like the NVIDIA Tesla C1060. The mesh data I need to manage can occupy more than 2Gb... that or use a virtual VRAM memory system through the AGP/PCI like the CPU does.... I think Vista can virtualise the VRAM.

0 Likes

bubu,
It is possible to get access to all the vram inside a kernel, but this requires using CAL, IL, and the global buffer(this gives you access to 32 bits of address space). There are problems with this, but these are operating system limitations mainly dealing with the mapping of graphics vram into the system pci memory space. There is usually a limit of around 200mb set by the driver/os that can be mapped directly at one point.

Memory management of the devices vram is left to the application writer as a single policy enforced by the driver is not ideal for all workloads.
0 Likes

Originally posted by: MicahVillmow bubu, It is possible to get access to all the vram inside a kernel, but this requires using CAL, IL, and the global buffer(this gives you access to 32 bits of address space).


 

But... can I use CAL to allocate the buffer and, then, pass the pointer to Brook+? I don't like the idea to program kernels in pseudo-assembly language. For example, this is what I want:

 

byte* dataIn = (byte*)calAllocateBuffer(2*1024*1024);

byte* dataOut = (byte*)calAllocateBuffer(1024);

 

void myBrookPlusKernel ( byte *ptr )

{

    const byte dIn = ptr[1800000000];

    dataOut[threadId] = dIn+10;

}

0 Likes

It is possible to write a kernel in Brook+, convert it to IL using the Brook+ compiler (brcc), and use the generated IL in CAL. Or better yet, put in the Brook+ kernel in ShaderAnalyzer and get the IL from there.

It is not as straightforward as what you asked, but it is a good middle ground where you get all CAL functionality and the ease of Brook+ kernel programming. 

0 Likes

Originally posted by: udeepta@amd It is possible to write a kernel in Brook+, convert it to IL using the Brook+ compiler (brcc), and use the generated IL in CAL. Or better yet, put in the Brook+ kernel in ShaderAnalyzer and get the IL from there.


 

Ok... so imagine I need to access 256Mb of VRAM in a 512Mb Radeon.

For example, let's fill that data with a value of 0.5f ( I know, a dumb kernel ).

I should do this in Brook+ ( just a pseudo code 😞

void kernel ( float buff[] )

{

   buff[(blockIdx.x*blockDim.x)+threadId.x] = 0.5f;

}

Then I compile it using Shaderanalyzer or whatever... so the AMD IL assembler is generated.

 

Then, In my C++ program I do:

 

//initialize CAL

...

//Allocate 256Mb using CAL

void *ptr = calMalloc(256*1024*1024);

 

//Load the kernel compiled and pass the VRAM pointer as IL input.

handle = calLoadKernel("c:\test\myKernel.il");

calSetInput(handle,"buff",ptr);

 

 

0 Likes

Is that the way, pls?

0 Likes

Hi Bubu,

Your point is a smart one.

Writing kernel in Brook in order to get an assembly program and then using CAL for overall memory management seems to me the best compromise.

What is needed from AMD is a short application note (les than one page) to provide guidance on how to proceed.

Could a support guy from AMD take this as an action?

I'm sure this would be of great (and easy) help to overcome some of the current Brook limits.

Thanks

Jean-Claude

0 Likes

To use IL with CAL, you will need to do some set up and mapping, as Micah explained. The CAL samples in the SDK show examples of that.

And you can go from Brook+ to IL easily. The issue is that brcc output IL is not compact, so the mapping in CAL can get murky, and I would not go that route now.

But we are working on cleaing up the brcc output IL for the next release -- am hoping that will help.

0 Likes

bubu,
This can be done using the C++ interface, but i'm not sure about the C interface. As I am more of a CAL engineer and not Brook+.
0 Likes

Jean-Claude,
One of the problem with that approach is that the code generated from brook requires a lot of setup and mapping that the brook+ runtime needs but you would not need if you were writing in CAL. Also the code that is generated is fairly difficult to read and there is a simpler approach that I used to use here at work.
The approach is this:
1) Break down all math instructions into the simplest form possible. I.e. (x = cos((y * z)/4)), make it x = y * z; x = x / 4; x = cos(x)
This allows for an almost 1-to-1 mapping to IL for all math instructions. If something you want to do is not possible in IL, write up a quick hlsl shader and paste it into GPU shader analyzer and see the sequence of instructions that are generated.
2) For all flow control statements, break the conditional statements into simpler statements and store the results in a variable.
i.e. if(x == 0 || (y * z) > w) ==> x_cond = (x == 0); y = y * z; y_cond = (y > w); if_cond = (x_cond || y_cond); if (if_cond == 0) ...
This is how you would write the code in IL, and you can debug it at a higher level and verify correctness and then do a 1-1 mapping to IL instructions
3) move simple if statements into cmov_logical instructions. i.e. if(conditional) a = b else a = c ==> cmov_logical a, conditional, b, c
4) Only use while(1) statements, instead of easier flow control, and place a if (if_cond == 0) break or if(if_cond == 0) continue. these translate easily to break_logicalz if_cond or continue_logicalz if_cond il instructions

If you follow these guidelines, you can translate any brook+ source code that has been tested and debuged into CAL/IL fairly easily with almost no major issues outside of typo's or selecting the wrong instruction.
0 Likes

Thanks Micah,

I'll have a look but honestly this sounds a little bit quite an heavy process!

Jean-Claude

0 Likes