Hi,
I have faced with small problem. My kernel function does not work correctly and return partly right result. This is function is very simple and I believe it is not my mistake. Could you please look at my example? What's wrong?
kernel void func1(unsigned char src[][], unsigned char str, out unsigned char o_img<>
{
// Output position
int j = instance().x; // width
int i = instance().y; // height
int rest = j % 16;
if (rest == 0)
{
o_img = src [ i]
}
else
{
o_img = str;
}
}
Input:
3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1
I just replace 1 by 9.
Wrong output:
3 9 9 9 9 9 9 9 9 9 9 9 9 9 9 3 9 9 9 9 9 9 9 9 9 9 9 9 9 9 3 9 9 9 9 9 9 9
You can see garbage in memory. If I remove if statement from the kernel , the output will be without garbage. But I need the first way.
I could not reproduce this issue on my system. Could you send me your system configuration?
Yes,
ATI Radeon 4800 HD, driver 8.600.0.0
Intel Core 2 6300 1.86 GHz, 1 Gb DDR2
Chunk of code just in case:
// Specifying the size of the 2D stream
unsigned int streamSize[] = {width, height};
// Specifying the rank of the stream
unsigned int rank = 2;
brook::Stream<unsigned char> inputStream(rank, streamSize);
// Copying data from input buffer to input stream
inputStream.read(src);
//--------------------------------------------------------------------------
// Creating the output stream
//--------------------------------------------------------------------------
streamSize[0] = width;
streamSize[1] = height;
brook::Stream<unsigned char> outputStream(rank, streamSize);
//--------------------------------------------------------------------------
// Executing kernel and copying back data
//--------------------------------------------------------------------------
unsigned char str = '9';
// Calling the kernel on the input and output streams
func1(inputStream, str, outputStream);
// Creating an output buffer
unsigned char* ref = new unsigned char[width * height];
memset(ref, 0, width * height);
// Copying data from output stream to output buffer
outputStream.write(ref);
Any ideas?
no idea, because some mistakes is too strange.
ok. i will try to reinstall driver or stream sdk.
I am also using 8.60 driver, but I don't see the issue.
What is your OS and what are the value for width & height?
I have tried the Windows Xp 32 and Windows Vista 64 SP1. The same result. The latest stream sdk and driver.
I have tried the different w&h. Full code:
#include <iostream>
#include "brook/Stream.h"
#include "brook/KernelInterface.h"
#include "brookgenfiles/kernel.h"
void print(unsigned char* arr, int width, int height)
{
for (int i = 0; i < height; i++)
{
for(int j = 0; j < width; j++)
{
char cStr[256];
sprintf(cStr, "% 3c ", arr[j + i * width]);
OutputDebugString(cStr);
}
OutputDebugString("\n");
}
}
int
main(int argc, char* argv[])
{
// Specifying the width and height of the 2D buffer
const unsigned int width = 49;
const unsigned int height = 6;
//--------------------------------------------------------------------------
// Creating and initializing the input buffer
//--------------------------------------------------------------------------
// Creating an input buffer
unsigned char* src = new unsigned char[width * height];
//memset(src, 7, width * height);
for (int i = 0; i < height; i++)
{
for(int j = 0; j < width; j++)
{
if (j % 16 == 0)
{
src[j + i * width] = '3';
}
else
{
src[j + i * width] = '1';
}
}
OutputDebugString("\n");
}
print(src, width, height);
// Initializing the input buffer such that
// input(i,j) = i*width + j
// fillBuffer(inputBuffer, width, height);
// Printing input buffer
fprintf(stdout, "Input buffer:\n");
//--------------------------------------------------------------------------
// Creating the input stream and copying data from input buffer
//--------------------------------------------------------------------------
// Specifying the size of the 2D stream
unsigned int streamSize[] = {width, height};
// Specifying the rank of the stream
unsigned int rank = 2;
// Create a 2D stream of specified size i.e. 64x64 floating-point values
brook::Stream<unsigned char> inputStream(rank, streamSize);
// Copying data from input buffer to input stream
inputStream.read(src);
//--------------------------------------------------------------------------
// Creating the output stream
//--------------------------------------------------------------------------
streamSize[0] = width;
streamSize[1] = height;
brook::Stream<unsigned char> outputStream(rank, streamSize);
//--------------------------------------------------------------------------
// Executing kernel and copying back data
//--------------------------------------------------------------------------
unsigned char str = '9';
// Calling the kernel on the input and output streams
func1(inputStream, str, outputStream);
// Creating an output buffer
unsigned char* ref = new unsigned char[width * height];
memset(ref, 0, width * height);
//memset(ref, 0, width * height * sizeof(just));
//print(ref, width, height);
// Copying data from output stream to output buffer
outputStream.write(ref);
print(ref, width, height);
// Check error on stream
if(outputStream.error())
{
// Print error Log associated to stream
fprintf(stdout, "%s\n", outputStream.errorLog());
}
fprintf(stdout, "Output buffer:\n");
// printBuffer(outputBuffer, width, 0, 0, 8, 8);
//--------------------------------------------------------------------------
// Checking whether the result is correct or not
//--------------------------------------------------------------------------
//--------------------------------------------------------------------------
// Cleaning up
//--------------------------------------------------------------------------
delete[] src;
delete[] ref;
return 0;
}
I just have noticed one thing. Depence on width and height of streamSize for output brook stream I have the different results. I mean the different garbage location.
I suppose something wrong in my kernel function
If I set
streamSize[0] = width;
streamSize[1] = 1;
brook::Stream<unsigned char> outputStream(rank, streamSize);
The result is correct. As soon as I set height > 1 the problem is occured.
hmm. I obtained correct result. The changes
kernel void func1(unsigned char src[], unsigned char str, unsigned char str2, out unsigned char o_img<>
src is one dimensional array.
And I set src rank to 1, dst to 2.
Please comment on this. What was the reason for the problem? Is it my allocation approach?
unsigned char* src = new unsigned char[width * height];
Please respond.
is dst height 1?
Are you able to run samples\legacy\tests\sum?
Let me know sum sample runing or not?
Yes, I'm able. No problem here. Now the height and width can be any. My example works as I expected. I think it's my misundstanding of conception and I would ask you to explain me what is wrong in my mind.
What changes you made to your code?
I did not see any problems with pasted code on the top
The changes were:
1. I set stream rank 1 for the src stream instead of 2.
unsigned int rank = 1;
brook::Stream<unsigned char> inputStream(rank, streamSize);
inputStream.read(src);
2. I changed accordingly my kernel function. You can see one dimensional src array [], instead of [][] in previous version.
kernel void func1(unsigned char src[], unsigned char str, out unsigned char o_img<>
{
// Output position
int2 vPos = instance().xy;
int j = vPos.x; // width
int i = vPos.y; // height
int rest = j % 16;
if (rest > 0)
{
o_img = str;
}
else
{
o_img = src[j + i * 40];
}
}
That's all.
constant 40 in code above is width
In kernel code, dimension of output is importent
you can also use src[][] but in this case both size and dimensions of src and dst must be same
I did not change of output properties, only input.
And the your last sentence describes my first approach, when i obtained incorrect results (garbage in memory).
So question is still open.
With the given width & height, I could reproduce this. A quick workaround to resolve this problem is to use regular strream instead of gather stream-
kernel void func1(unsigned char src<>, unsigned char str, out unsigned char o_img<> )
{
// Output position
int j = instance().x; // width
int i = instance().y; // height
int rest = j % 16;
if (rest == 0)
{
o_img = src;
}
else
{
o_img = str;
}
}
Now, its confirmed that its a regression with Catalyst 9.4. You can try with previous version of catalyst.
So, in other words, it is a driver problem. Is it right?
Yes.
Ok. Thank you very much for your support.
Today I've faced with other problem. I expect another behaviour.
Kernel:
kernel void motion_estimation(unsigned char src[],
unsigned char ref[],
unsigned char str,
unsigned char str2,
out unsigned char o_img<>,
out double sad<>
{
// Output position
int2 vPos = instance().xy;
int i = vPos.x; // width
int j = vPos.y; // height
if ((j % 4 > 0 || i % 4 > 0) || (j == 12 || i == 32))
{
o_img = str;
sad = 1.0;
}
else
{
estimate_macroblock_4x4(src, ref, i, j, str2, o_img, sad);
}
}
kernel int estimate_macroblock_4x4(unsigned char mbs[],
unsigned char mbr[],
int i, int j,
unsigned char str,
out unsigned char o_img<>,
out double sad<>
{
int x, y;
//sad = (double) (i + 0 + ((j + 0) * 33)) ;
for (x = 0; x < 4; x++)
{
for (y = 0; y < 4; y++)
{
// PROBLEM IS HERE
int index = i + x + ((j + y) * 33);
sad += (double)(mbs[index] - mbr[index]);
}
}
return 0;
}
The part of sad output is
-71.000000 1.000000 1.000000 1.000000 -71.000000 1.000000 ...
1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ...
1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ...
1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ...
-80.000000 1.000000 1.000000 1.000000 -80.000000 1.000000 ...
1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ...
1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ...
1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ...
-71 is correct value. It is sum of differences between blocks
* 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
and
/ 3 3 3
3 3 3 3
3 3 3 3
3 3 3 3
-80.0 it is difference only between * and /. I expect everywhere -71 instead of 80. It seems like for { for ... does not work for j > 0.
I would recommend you to first try with catalyst 9.2 and see if your problems resolve.
The same problem on 9.2. Only different figures. The same description
4025.000000 1.000000 1.000000 1.0000
1.000000 1.000000 1.000000 1.000000
1.000000 1.000000 1.000000 1.000000
1.000000 1.000000 1.000000 1.000000
4016.000000 1.000000 1.000000 1.0000
1.000000 1.000000 1.000000 1.000000
1.000000 1.000000 1.000000 1.000000
1.000000 1.000000 1.000000 1.000000
If you are using 2D streams, you must use [][] for gather streams.
What do you mean? Where I should use [][]? instead of sad<>?
kernel void motion_estimation(unsigned char src[], unsigned char ref[], unsigned char str, unsigned char str2, out unsigned char o_img<>, out double sad<> { // Output position int2 vPos = instance().xy; int i = vPos.x; // width int j = vPos.y; // height
If your input streams src & ref are 2D streams use [][], otherwise its fine.
My input streams are 1D. Output are 2D.
Then its fine. Could you post your runtime code as well?
Do you mean .cpp generated code?
////////////////////////////////////////////
// Generated by BRCC 1.4
// BRCC Compiled on: Mar 2 2009 13:07:15
////////////////////////////////////////////
#include "brook/brook.h"
#include "kernel_gpu.h"
#include "kernel.h"
static __BrtInt1 __estimate_macroblock_4x4_cpu_inner(const __BrtArray<__BrtUChar1 > &mbs,
const __BrtArray<__BrtUChar1 > &mbr,
const __BrtInt1 &i,
const __BrtInt1 &j,
const __BrtUChar1 &str,
__BrtUChar1 &o_img,
__BrtDouble1 &sad)
{
__BrtInt1 y, x;
for (y = __BrtInt1((int)0); y < __BrtInt1((int)4); y++)
{
for (x = __BrtInt1((int)0); x < __BrtInt1((int)4); x++)
{
__BrtInt1 index = i + x + (j + y) * __BrtInt1((int)33);
sad += (__BrtDouble1 ) (mbs[index] - mbr[index]);
}
}
return __BrtInt1((int)0);
}
void __estimate_macroblock_4x4_cpu(::brt::KernelC *__k, int __brt_idxstart, int __brt_idxend, bool __brt_isreduce)
{
__BrtArray<__BrtUChar1 > *arg_mbs = (__BrtArray<__BrtUChar1 > *) __k->getVectorElement(0);
__BrtArray<__BrtUChar1 > *arg_mbr = (__BrtArray<__BrtUChar1 > *) __k->getVectorElement(1);
__BrtInt1 *arg_i = (__BrtInt1 *) __k->getVectorElement(2);
__BrtInt1 *arg_j = (__BrtInt1 *) __k->getVectorElement(3);
__BrtUChar1 *arg_str = (__BrtUChar1 *) __k->getVectorElement(4);
::brt::StreamInterface *arg_o_img = (::brt::StreamInterface *) __k->getVectorElement(5);
::brt::StreamInterface *arg_sad = (::brt::StreamInterface *) __k->getVectorElement(6);
for(int __brt_idx=__brt_idxstart; __brt_idx<__brt_idxend; __brt_idx++) {
if(!(__k->isValidAddress(__brt_idx))){ continue; }
Addressable <__BrtUChar1 > __out_arg_o_img((__BrtUChar1 *) __k->FetchElem(arg_o_img, __brt_idx));
Addressable <__BrtDouble1 > __out_arg_sad((__BrtDouble1 *) __k->FetchElem(arg_sad, __brt_idx));
__estimate_macroblock_4x4_cpu_inner (
*arg_mbs,
*arg_mbr,
*arg_i,
*arg_j,
*arg_str,
__out_arg_o_img,
__out_arg_sad);
*reinterpret_cast<__BrtUChar1 *>(__out_arg_o_img.address) = __out_arg_o_img.castToArg(*reinterpret_cast<__BrtUChar1 *>(__out_arg_o_img.address));
*reinterpret_cast<__BrtDouble1 *>(__out_arg_sad.address) = __out_arg_sad.castToArg(*reinterpret_cast<__BrtDouble1 *>(__out_arg_sad.address));
}
}
void __motion_estimation_cpu_inner(const __BrtArray<__BrtUChar1 > &src,
const __BrtArray<__BrtUChar1 > &ref,
const __BrtUChar1 &str,
const __BrtUChar1 &str2,
__BrtUChar1 &o_img,
__BrtDouble1 &sad)
{
__BrtInt2 vPos = (indexof(o_img)).swizzle2(::brt::maskX, ::brt::maskY);
__BrtInt1 i = vPos.swizzle1(::brt::maskX);
__BrtInt1 j = vPos.swizzle1(::brt::maskY);
if (j % __BrtInt1((int)4) > __BrtInt1((int)0) || i % __BrtInt1((int)4) > __BrtInt1((int)0) || (j == __BrtInt1((int)12) || i == __BrtInt1((int)32)))
{
o_img = str;
sad = __BrtDouble1((double)1.0);
}
else
{
o_img = src[i + j * __BrtInt1((int)33)];
__estimate_macroblock_4x4_cpu_inner(src, ref, i, j, str2, o_img, sad);
}
}
void __motion_estimation_cpu(::brt::KernelC *__k, int __brt_idxstart, int __brt_idxend, bool __brt_isreduce)
{
__BrtArray<__BrtUChar1 > *arg_src = (__BrtArray<__BrtUChar1 > *) __k->getVectorElement(0);
__BrtArray<__BrtUChar1 > *arg_ref = (__BrtArray<__BrtUChar1 > *) __k->getVectorElement(1);
__BrtUChar1 *arg_str = (__BrtUChar1 *) __k->getVectorElement(2);
__BrtUChar1 *arg_str2 = (__BrtUChar1 *) __k->getVectorElement(3);
::brt::StreamInterface *arg_o_img = (::brt::StreamInterface *) __k->getVectorElement(4);
::brt::StreamInterface *arg_sad = (::brt::StreamInterface *) __k->getVectorElement(5);
for(int __brt_idx=__brt_idxstart; __brt_idx<__brt_idxend; __brt_idx++) {
if(!(__k->isValidAddress(__brt_idx))){ continue; }
Addressable <__BrtUChar1 > __out_arg_o_img((__BrtUChar1 *) __k->FetchElem(arg_o_img, __brt_idx));
Addressable <__BrtDouble1 > __out_arg_sad((__BrtDouble1 *) __k->FetchElem(arg_sad, __brt_idx));
__motion_estimation_cpu_inner (
*arg_src,
*arg_ref,
*arg_str,
*arg_str2,
__out_arg_o_img,
__out_arg_sad);
*reinterpret_cast<__BrtUChar1 *>(__out_arg_o_img.address) = __out_arg_o_img.castToArg(*reinterpret_cast<__BrtUChar1 *>(__out_arg_o_img.address));
*reinterpret_cast<__BrtDouble1 *>(__out_arg_sad.address) = __out_arg_sad.castToArg(*reinterpret_cast<__BrtDouble1 *>(__out_arg_sad.address));
}
}
void __motion_estimation:perator()(const ::brook::Stream< uchar >& src, const ::brook::Stream< uchar >& ref,
const uchar str,
const uchar str2,
const ::brook::Stream< uchar >& o_img,
const ::brook::Stream< double >& sad)
{
static const void *__motion_estimation_fp[] = {
"cal", __motion_estimation_cal,
"cpu", (void *) __motion_estimation_cpu,
NULL, NULL };
::brook::Kernel __k(__motion_estimation_fp, brook::KERNEL_MAP);
::brook::ArgumentInfo __argumentInfo;
__k.PushGatherStream(src);
__k.PushGatherStream(ref);
brook::Constant<uchar > constant_2(str);
__k.PushConstant(constant_2);
brook::Constant<uchar > constant_3(str2);
__k.PushConstant(constant_3);
__k.PushOutput(o_img);
__k.PushOutput(sad);
__argumentInfo.startExecDomain = _domainOffset;
__argumentInfo.domainDimension = _domainSize;
__k.run(&__argumentInfo);
DESTROYPARAM();
}
__THREAD__ __motion_estimation motion_estimation;
The code where you declare stream, call kernel and call different operators on stream.
#include <iostream>
#include "brook/Stream.h"
#include "brook/KernelInterface.h"
#include "brookgenfiles/kernel.h"
void print(unsigned char* arr, int width, int height)
{
for (int i = 0; i < height; i++)
{
for(int j = 0; j < width; j++)
{
char cStr[256];
sprintf(cStr, "% 3c ", arr[j + i * width]);
OutputDebugString(cStr);
}
OutputDebugString("\n");
}
OutputDebugString("\n\n");
}
void printd(double* arr, int width, int height)
{
for (int i = 0; i < height; i++)
{
for(int j = 0; j < width; j++)
{
char cStr[256];
sprintf(cStr, "% 3f ", arr[j + i * width]);
OutputDebugString(cStr);
}
OutputDebugString("\n");
}
OutputDebugString("\n\n");
}
int
main(int argc, char* argv[])
{
// Specifying the width and height of the 2D buffer
const unsigned int width = 33;
const unsigned int height = 13;
//--------------------------------------------------------------------------
// Creating and initializing the input buffer
//--------------------------------------------------------------------------
// Creating an input buffer
unsigned char* src = new unsigned char[width * height];
unsigned char* ref = new unsigned char[width * height];
//memset(src, 7, width * height);
for (int i = 0; i < height; i++)
{
for(int j = 0; j < width; j++)
{
if (j % 4 == 0 && i % 4 == 0)
{
src[j + i * width] = '*';
ref[j + i * width] = '/';
}
else
{
src[j + i * width] = '1';
ref[j + i * width] = '3';
}
}
}
print(src, width, height);
print(ref, width, height);
// specifying the size of the 2D stream
unsigned int streamSize[] = {width, height};
// specifying the rank of the stream
unsigned int rank = 1;
brook::Stream<unsigned char> srcStream(rank, streamSize);
brook::Stream<unsigned char> refStream(rank, streamSize);
// copying data from input buffer to input stream
srcStream.read(src);
refStream.read(ref);
// creating the output stream
streamSize[0] = width;
streamSize[1] = height;
rank = 2;
brook::Stream<unsigned char> outputStream(rank, streamSize);
// creating the output stream
streamSize[0] = width;
streamSize[1] = height;
rank = 2;
brook::Stream<double> sad(rank, streamSize);
//--------------------------------------------------------------------------
// Executing kernel and copying back data
//--------------------------------------------------------------------------
unsigned char str = '9';
unsigned char str2 = '+';
double ddd = src[0] - ref[0];
double sadd = 0;
for (int y = 0; y < 4; y++)
{
for (int x = 0; x < 4; x++)
{
int index = 0 + x + ((0 + y) * 33);
sadd += (double)(src[index] - ref[index]);
}
}
// Calling the kernel on the input and output streams
motion_estimation(srcStream, refStream, str, str2, outputStream, sad);
// Creating an output buffer
unsigned char* out = new unsigned char[width * height];
memset(out, 0, width * height);
double *das = new double[width * height];
memset(out, 0, width * height);
// Copying data from output stream to output buffer
outputStream.write(out);
sad.write(das);
print(out, width, height);
printd(das, width, height);
// Check error on stream
if(outputStream.error())
{
// Print error Log associated to stream
fprintf(stdout, "%s\n", outputStream.errorLog());
}
fprintf(stdout, "Output buffer:\n");
// printBuffer(outputBuffer, width, 0, 0, 8, 8);
//--------------------------------------------------------------------------
// Checking whether the result is correct or not
//--------------------------------------------------------------------------
//--------------------------------------------------------------------------
// Cleaning up
//--------------------------------------------------------------------------
delete[] src;
delete[] ref;
return 0;
}
kernel int estimate_macroblock_4x4(unsigned char mbs[],
unsigned char mbr[],
int i, int j,
unsigned char str,
out unsigned char o_img<>,
out double sad<>
{
//o_img = str;
int x, y;
//sad = (double) (i + 0 + ((j + 0) * 33)) ;
for (y = 0; y < 4; y++)
{
for (x = 0; x < 4; x++)
{
int index = i + x + ((j + y) * 33);
sad += (double)(mbs[index] - mbr[index]);
}
}
//o_img = str;
return 0;
}
kernel void motion_estimation(unsigned char src[],
unsigned char ref[],
unsigned char str,
unsigned char str2,
out unsigned char o_img<>,
out double sad<>
{
// Output position
int2 vPos = instance().xy;
int i = vPos.x; // width
int j = vPos.y; // height
if ((j % 4 > 0 || i % 4 > 0) || (j == 12 || i == 32))
{
o_img = str;
sad = 1.0;
}
else
{
o_img = src[i + j * 33];
estimate_macroblock_4x4(src, ref, i, j, str2, o_img, sad);
}
}
One thing that is definitely wrong with your test case is out of range indexing of 1D input streams.Your input stream is 1D with size = width and not width * height
// specifying the size of the 2D stream
unsigned int streamSize[] = {width, height};
// specifying the rank of the stream
unsigned int rank = 1;
brook::Stream<unsigned char> srcStream(rank, streamSize);
brook::Stream<unsigned char> refStream(rank, streamSize);
I think you want to do the following-
// specifying the size of the 2D stream
unsigned int streamSize[] = {width * height};
// specifying the rank of the stream
unsigned int rank = 1;
brook::Stream<unsigned char> srcStream(rank, streamSize);
brook::Stream<unsigned char> refStream(rank, streamSize);
Yes, you are right. It helps. Thank you.
Hello again! I decided to continue this topic by next question.
This line if (testsad <= sad
error C2676: binary '<=' : '__BrtDouble1' does not define this operator or a conversion to a type acceptable to the predefined operator
kernel void motion_estimation(unsigned char src[],
unsigned char ref[],
int width,
int height,
out double sad[][])
What is the problem here? I can not obtain elements from output array?
It seems sad is 2D scatter stream, shouldn't you index it with 2D indices.
No Of course I use [] [], it is forum problem. The second brackets were removed by unknown reasons. I put space after '[' and it helps.
Ok. Any other ideas?
Could you post the datatype of testsad? These are some template errors from CPU runtime and doesn't show up on all the versions of gcc.
I would suggest you to disable CPU backend code generation to resolve these issues. You can compile .br file with -p cal option to disable CPU codegen.