Hi,
I have faced with small problem. My kernel function does not work correctly and return partly right result. This is function is very simple and I believe it is not my mistake. Could you please look at my example? What's wrong?
kernel void func1(unsigned char src[][], unsigned char str, out unsigned char o_img<>
{
// Output position
int j = instance().x; // width
int i = instance().y; // height
int rest = j % 16;
if (rest == 0)
{
o_img = src [ i]
}
else
{
o_img = str;
}
}
Input:
3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1
I just replace 1 by 9.
Wrong output:
3 9 9 9 9 9 9 9 9 9 9 9 9 9 9 3 9 9 9 9 9 9 9 9 9 9 9 9 9 9 3 9 9 9 9 9 9 9
You can see garbage in memory. If I remove if statement from the kernel , the output will be without garbage. But I need the first way.
the type is double.
i use mvc. i will try you advice
Do you mean like this?
mkdir brookgenfiles | "$(BROOKROOT)\sdk\bin\brcc_d.exe" -p cal -o "$(ProjectDir)\brookgenfiles\$(InputName)" "$(InputPath)"
It helps. Thanks.
Can I install latest ATI drivers? I mean did you fix that problem what was specified at beginning of the thread (memory garbage)?
Thanks
Additional remark regarding brook compiler.
The expression:
double sad = (double)(abs((((src[idx]) - ((ref[idx])))), where src and ref are unsigned char [] cause repletion.
The correct variant here
double sad = (double)(abs((((int)src[idx]) - ((int)ref[idx]))))
But I believe compiler should automatically converts to integer operation.
Additional question.
Can I pass more than one output buffer.
As I understand output buffer defines domain of execution. So kernel can use only one output stream. Is it right?
I need the additional array with the same size as output stream. Like this:
kernel void motion_estimation(unsigned char src[],
unsigned char ref[],
int width,
int height,
int mv[][], // additional buffer
out double sad[][])
You can use multiple regular output streams, but multiple scatter streams are not supported.
Give me example please? Is it affected performance?
kernel void multiple_ouput(out float o0<>, out float4 o1<> //valid - Good in performance as I would expect it would increase compute intensity of kernel compared to calling two kernel with single output streams
kernel void multiple_scatter(out float o0[], out float4 o1[]) // not supported
kernel void mix_output(out float o0[], out float4 o1<> // supported, but computation is done in multiple passes, so performance is similar to calling two kernels with single output streams
Ok. Thanks.
Are these chunks of code similar?
kernel void motion_estimation(unsigned char src[],
unsigned char ref[],
int width,
int height,
out double sad[][])
{
// Output position
int2 vPos = instance().xy;
int i = vPos.x; // width
int j = vPos.y; // height
if (i % 16 == 0 && j % 16 == 0)
sad
}
and
kernel void motion_estimation(unsigned char src[],
unsigned char ref[],
int width,
int height,
out double sad<>
{
// Output position
int2 vPos = instance().xy;
int i = vPos.x; // width
int j = vPos.y; // height
if (i % 16 == 0 && j % 16 == 0)
sad = 1.0;
}
The next question.
The key -p cal helps to avoid compile template errors, but unfortunately it hampers to debug program. I mean return values in out stream are corrupted when I compiled program with -p cal key. As soon as I remove -p cal and rebuild project this problem fades out. But template errors return
What do you advice me?
Yes, both the above kernels are same and the second kernel would have much better performance. Scatter streams are used for random writing, but if you always write to instance() position, its better to use regular output stream.
-p cal disables CPU backend codegen, so as long as you are are not running your code in CPU emulation mode, everything should be fine. Make sure you have not set environment variable BRT_RUNTIME=cpu
Gaurav,
I wonder, what is advantage of using cpu emulator? I understand that instructions are executed on cpu. And what...? Anyway I can not to enter kernel and debugging inside.
The purpose of CPU backend code is for debugging only. You can debug inside kernel if you disable line generation in cpp file (use -nl option)
Gaurav,
BRT_RUNTIME = cal
mkdir brookgenfiles | "$(BROOKROOT)\sdk\bin\brcc_d.exe" -p cal -o "$(ProjectDir)\brookgenfiles\$(InputName)" "$(InputPath)"
Ouput is broken yet. Why?
That is strange. Are you sure it works without -p cal option? I mean how did you test it with template error? Could you post the test case?
I just use the simple test in this case.
kernel void motion_estimation(unsigned char src[],
unsigned char ref[],
int width,
int height,
out double sad<>,
out int mvx<>,
out int mvy<>
{
// Output position
int2 vPos = instance().xy;
int i = vPos.x; // width
int j = vPos.y; // height
int ix = i * 16;
int jy = j * 16;
sad = 2.0;
}
That's it. So, no templates. With -p cal option output sad contains garbage. With -p cpu everything is ok (2.0 value).
Something is going wrong. It seems you are running your code under CPU backend. Make sure you close your visual studo or command prompt after changing environment variable and then open it again to read the updated env variable.
You are right. I tried to restart VS, but it did not help.
The windows restart helps.
if ((sad >= testsad) && (mvlength > abs(y) + abs(x)))
{
sad = testsad;
mvlength = abs(y) + abs(x);
mvy = y;
mvx = x;
}
ERROR--1: In Binary expression: Mismatched operands: both must have same type and same number of components
1> Statement: sad >= testsad && mvlength > abs(y) + abs(x) in sad >= testsad && mvlength > abs(y) + abs(x)
1> Expression : sad >= testsad, Type : double
1> Expression : mvlength > abs(y) + abs(x), Type : int
Try to do this : if (((int)sad >= (int)testsad) && (mvlength > abs(y) + abs(x))) and this condition never is true.
brcc returns the same type from conditional expressions as of operands. You can try this-
if ((int)(sad >= testsad) && (mvlength > abs(y) + abs(x)))
No, I suppose your variant is incorrect. I have checked.
The correct:
if (((int)sad >= (int)testsad) && (mvlength > abs(y) + abs(x)))
I just figured out it.
But, it would cause a conversion of sad and testsad before checking the condition and can produce incorrect results
Hmm, a many days try to understand what is going on with my kernel code.
Probably you can help me. Are these code chuncks similar? I mean logic.
This chunk I execute on cpu
int xleft = 0, xright = 16;
int ytop = 0, ybottom = 16;
int temp = 0;
int mvlength = 100000000;
for (int j = 0; j < height; j += 16)
{
for (int i = 0; i < width; i += 16)
{
// set top and bottom range
ytop = - min(j, 16);
ybottom = min(height - 16 - j + 1, 16);
// set left and right range
xleft = - min(i, 16);
xright = min(width - 16 - i + 1, 16);
refsad
for (int y = ytop; y < ybottom; y++)
{
for (int x = xleft; x < xright; x++)
{
int srcidx = i + (j * width);
int index = i + x + ((j + y) * width);
// calculate SAD
//--------------------------------
for (m = 0; m < 16; m++)
{
for (n = 0; n < 16; n++)
{
temp += abs((src[srcidx + n] - ref[index + n]));
}
srcidx += width;
index += width;
}
//-------------------------------
if ((refsad
{
refsad
mvlength = abs(x) + abs(y);
refmvx
refmvy
refmvl
}
temp = 0.0;
}
}
l++;
mvlength = 100000000;
}
}
And this as kernel
int ytop = - min(jy, 16);
int ybottom = min(height - 16 - jy + 1, 16);
// set left and right range
int xleft = - min(ix, 16);
int xright = min(width - 16 - ix + 1, 16);
int x, y;
int m, n;
int mvlength = 100000000;
sad = 100000000;
for (y = ytop; y < ybottom; y++)
{
for (x = xleft; x < xright; x++)
{
int testsad = 0;
int srcidx = ix + (jy * width);
int idx = ix + x + ((jy + y) * width);
for (m = 0; m < 16; m++)
{
for (n = 0; n < 16; n++)
{
testsad += (abs((((int)src[srcidx + n]) - ((int)ref[idx + n]))));
}
srcidx += width;
idx += width;
}
if ((sad >= testsad) && (mvlength > (abs(y) + abs(x))))
{
sad = testsad;
mvlength = (abs(y) + abs(x));
mvy = y;
mvx = x;
mvl = mvlength;
}
}
}
}
What do you think is the same logic? I have different results in mvx and mvy. Probably you see mistakes in kernel code. Because I expect absolutely the same behaviour.
I think problem exists in latest if.
if ((sad >= testsad) && (mvlength > (abs(y) + abs(x))))
If you need additional code let me know.
I'm sure it is driver problem again.
I have debugged in cpu mode (everytime forgot about debug mode) and there are no problems.
The problems are only in cal mode.
Can I send code by email? It is not comfortable to publish on the forum.
Yes, you can email on the address mentioned in my profile. I would take a look as soon as I get some free cycles.
Done. Please let me know asap.
Thank you very much!
Gaurav,
Please confirm that you have received my email.
Yes, I have received your mail, but I couldn't find any issue with your code. It seems an issue on driver side? Which Catalyst are you using?
Could you try it with 9.2?
What is output of my test? Are the similar results on cpu and cal modes?
No, the results were different. CPU mode was showing all the value to 0, whereas CAL was showing values to 15 (except first column or first row that was 0).
Ok. So you agree that problem exist on the driver level.
My catalyst version is 2009.0428.2132.36839, driver version 8.612.0.0000.
It is the latest release for x64 platform.
Before I used old version and the same problems were obtained. Unfortunately I don't remember version number.
I have tried 9.2 catalyst. The same problem. What do you advice me?
I am still waiting for you help. It is very important for me.
What is official your position?
Do you have plans to fix such bugs?