Hi Gaurav,
If my understanding is correct there are 15 constant buffers,
which can contain upto 1024*4 elements.
The naming goes from cb0 to cb14.
So consider the following kernel
kernel void k_trial(out float A<>, float BO<>, float B1<>, float C0, float C1) {
A = B0*C0 + B1*C1;
}
To use the generated IL in CAL, I would assume the following naming for binding:
input A <=> o0
outputs B0 <=> i0 B1 <=> i1
constants C0 <=> cb0 C1 <=> cb1
That sounds fine for input & outputs, but for constants it seems that there
are both folded in cb0 through :
dcl_cb cb0[2]
Questions:
(1) so what are the variable names for C0 and C1 ?
CALname C0_name = 0;
calModuleGetName(&C0_name, ctx, module_k_trial, "???");
calCtxSetMem(ctx, C0_name, C0_Mem);
CALname C1_name = 0;
calModuleGetName(&C1_name, ctx, module_k_trial, "???");
...
(2) is this to say that for n constants the generated IL would be dcl_cb cb0
but then when are cb1, ..., cb14 used?
Thanks for some hints.
Jean-Claude
Hi Jean-Claude,
All the contants declared are combined into a single constant buffer. So, you need to bind this data to single constant buffer. When you allocate data it has to be 128-bit aligned (always allocate *_4 CAL format) for each constant as CAL has 128-bit alignment requiremnts with constants.
calResAlloc*(, , 2, CAL_FORMAT_FLOAT_4, ); // Resource of Width 2 (for 2 constants)
calResMap(&ptr, constRes, );
ptr[0] = firstConstant;
ptr[5] = secondConstant; //Write after 128-bits asigned for first constant
You should bind this resource with "cbo". Hope it helps.
Hey thanks,
So just to see if my understanding is correct:
kernel void k_trial(out float A<>, float BO<>, float B1<>, float C0, float C1, float C2 ) {
A = B0*C0 + B1*C1 - C2;
}
// Allocate 4 float constants
// Note: Resource width is set to 1 (equivalent of 4 float constants) (is this safe?)
// ----------------------------------------------------------------------------------
CALresource constants_Res;
calResAllocLocal1D(&constants_Res, device, 1, CAL_FORMAT_FLOAT_4, 0);
// Set constant values
// -------------------
calResMap(&ptr, constants_Res);
ptr[0] = Constant_0;
ptr[1] = Constant_1;
ptr[2] = Constant_2;
ptr[3] = Constant_3; // won't be used
calResUnmap(constants_Res);
// Binding to ctx
// --------------
CALmem constants_Mem = 0;
calCtxGetMem(&constants_Mem, ctx, constants_Res)
// Binding to kernel constant pin
// --------------------------------------
CALname constants_name = 0;
calModuleGetName(&constants_name, ctx, module_k_trial, "cb0");
calCtxSetMem(ctx, constants_name, constants_Mem);
...
then execute kernel
Right?
Resource width should be same as number of constants. Also, notice the assignment of mapped pointer.
CALresource constants_Res;
calResAllocLocal1D(&constants_Res, device, 3, CAL_FORMAT_FLOAT_4, 0);
// Set constant values
// -------------------
calResMap(&ptr, constants_Res);
ptr[0] = Constant_0;
ptr[4] = Constant_1;
ptr[8] = Constant_2;
calResUnmap(constants_Res);
Got it!
So apparently there is no way to "trick" the allocator by assigning a FLOAT_4 resource, and then use FLOAT_1 slices in it?
But then when you use your second approach, you get FLOAT_1 constant items, isn't it ? ie.
kernel void test(float a[1024], float b[16][16], out float c<> {...}
for constant vector a (ie cb0), I would then assume that the upfront CAL related resource allocation would be
calResAllocLocal1D(&constant-a, device, 1024, CAL_FORMAT_FLOAT, 0);
or the equivalent to ensure alignment
calResAllocLocal1D(&constant-a, device, 1024/4, CAL_FORMAT_FLOAT_4, 0);
Unfortunately, CAL requires 128-bit allocation for constant buffers irrespective of type of constant array.
So, you need to allocate resource like this-
calResAllocLocal1D(&constant-a, device, 1024, CAL_FORMAT_FLOAT_4, 0);
if you are using constant array of type float, float2 ot float4 with 1024 elements.
Ok, sorry but I'm sill getting somewhat confused
Assume I have 64 float constants from A0 to A63
Assume my kernel should add constant A6 to input stream.
The resource allocation is done through:
calResAllocLocal1D(&constant_A, device, 64, CAL_FORMAT_FLOAT_4, 0);
which actually contains 64*4 float values
The correct kernel is then (with A being assigned to cb0)
kernel void test ( out float C<>, float A[64], float B<> ) { C = B + A[5]; }
in other words the index in A is a 128bits index...
So my question is what would the following kernels mean and do?
(1) kernel void test ( out float C<>, float4 A[64], float B<> ) { C = B + A[5].x; }
// and moreover the one to write A constants from the 64 first elements of a D stream
the CALoutput domain being {0,0,64,1}
(2) kernel void set_A ( out float A<>, float D[] ) {
int pos = instance().x;
A = D[pos];
}
or should it be
(3) kernel void set_A ( out float4 A<>, float D[] ) {
int pos = instance().x;
A.x = D[pos];
}
By the way, additionally it seems that Brook compiler is having difficulty with this:
(1) kernel void test1 ( float a[128], float b<>, out float c<> { c = a[33] + b;}
compiles properly, no problem
but if the order of parameter is changed...
(2) kernel void test2 ( out float c<>, float b<>, float a[128]) { c = a[33] + b;}
NOTICE: Parse error
While processing <buffer>:88
In compiler at zzerror()[parser.y:112]
message = parse error
ERROR: Parse error. Expected declaration.
While processing <buffer>:88
And this compiles properly too ...
kernel void test2 ( out float c<>, float b<>, float a[128], int d) {
c = a[33] + b + d;
}
Sounds like Brook compiler doesn't like fixed sized constant definition being the last parameter...
Thanks for pointing out the compilation issues. I have filed a bug for it. Regarding your previos question-
Both the kernels works the same way.
kernel void test ( out float C<>, float A[64], float B<> ) { C = B + A[5]; }
kernel void test ( out float C<>, float4 A[64], float B<> ) { C = B + A[5].x; }
When you call a kernel with constant buffer, a pointer of constant array is passed, not a stream. you need to pass a float and float4 pointer in these kernel. Runtime will always internally allocate a float4 CAL buffer and copy data such that it maintains 128-bit straddling for each element. Hope it helps in understanding.
Originally posted by: jean-claude (2) is this to say that for n constants the generated IL would be dcl_cb cb0 ? but then when are cb1, ..., cb14 used?
You can use multiple constant buffers through Brook+ if you declare constants with their size in square bracket.
kernel void test(float a[1024], float b[16][16], out float c<>; //It will allocate two constants buffers of size 1024 and 256 = (16x16)