cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

wellwill
Journeyman III

Shared memory usage cause blue screen

Hi, I have a question when I try to use the shared memory(LDS).

The sample code is in attach, and it will cause blue screen issue(if I turn of the recover system, it will hang and I should shut down the system).

If I change the group size from 256 to 64, it will run successfully.

What is the problem in my sample code?

Thanks!

//lds.br Attribute[GroupSize(256, 1, 1)] kernel void ati_test(out int output[][]) { shared float4 lds[1024]; int2 index = instance().xy; int i; for(i = 0; i < 30; i++) { lds[4 * instanceInGroup().x + 0] = float4(0.0f, 0.0f, 0.0f, 0.0f); syncGroup(); } output[index.y][index.x] = 0; } // main.cpp #include "brookgenfiles/lds.h" int main() { for(int i = 0; i < 30; i++) { unsigned int streamSize[] = {256, 4}; brook::Stream<int> output(2, streamSize); int *o0 = new int[256 * 4]; ati_test(output); output.write(o0); delete []o0; } return 0; }

0 Likes
18 Replies
gaurav_garg
Adept I

Which OS and catalyst version are you using?

0 Likes

Also which graphics card are you using?
0 Likes

Sorry that reply so late.

My system is:

OS: Vista 32 bits

VGA: ATI Radeon HD4850 

SDK: 1.4.0_beta

Catalyst: 09.7

driver: 8.632-090702a-084683C-ATI

CPU: Intel Core 2 Duo E8400 3.0GHz

RAM: 2.0 GB

 

0 Likes

Should I change my catalyst version?

0 Likes
wellwill
Journeyman III

I find the sample code is worked now. =.=||

But I do not change anything~!!!!

Maybe there were some dirty things on my GPU at that time.

I will try to reproduce the problem, and fiqure out the root cause.

Sorry to bother you.

0 Likes

I reproduce the error code.(on attach)

The kernel function adds a for loop, then blue screen~!

If I change the group size from 256 into 64, it will run successfully.

I need to access lds and syngroup in the for loop in my program.

Is anything wrong in my sample code?

My environment is on above.

//lds.br Attribute[GroupSize(256, 1, 1)] kernel void ati_test(out int output[][]) { shared float4 lds[1024]; int2 index = instance().xy; int i; for(i = 0; i < 30; i++) { lds[4 * instanceInGroup().x + 0] = float4(0.0f, 0.0f, 0.0f, 0.0f); syncGroup(); } output[index.y][index.x] = 0; } // main.cpp #include "brookgenfiles/lds.h" int main() { for(int i = 0; i < 30; i++) { unsigned int streamSize[] = {256, 4}; brook::Stream<int> output(2, streamSize); int *o0 = new int[256 * 4]; ati_test(output); output.write(o0); delete []o0; } return 0; }

0 Likes

Your code works fine on my system. Run it 3 times in a row, no problem.

XP32, Cat 9.7, 4850, 4GB DDR2 RAM, Q6600, X38

0 Likes

Thanks for reply.

So you can syngroup in the for loop with group size = 256.

It's so wear. I only can run with group size = 64. >"<

 

0 Likes

My friend's computer is also worked! Orz

Maybe I should re-setup my driver.

0 Likes
wellwill
Journeyman III

Setup the latest driver is not worked.

I change the other graphics card(3650), then it works!

It looks that my 4850 graphics card has broken. =.=||

(It was only bought for a month~~ >"<

Sorry for every one spend time on my problem!

0 Likes

After checking with my friends,

graphics cards "3450, 3650, 4350" are worked,

but "4850, 4890" are not worked on this sample code!!

 

PS. By running the ati brook sample code, 3650 is not supported on scatter(I guess 3450 is not support too?)

0 Likes

wellwill,
no compute shader will work on the 3XXX series of grpahics cards, so if your brook+ kernel that you pasted above is executing, it is running in pixel shader mode. For performance reasons, I would only use a group size of 64 in compute shader on 4XXX series of cards.
0 Likes

I am not sure how did it work with RV670 series of card. Brook+ checks if underlying hardware supports compute shader, otherwise it marks the output stream with error? Did you check the error on your output stream?

0 Likes

I only tested the blue screen issue, so I was not check the output result.

It's my fault!

I just wonder why group size larger than 64 cannot work.

If the group size with 64 can have the best performance, I will try to develop my stream code with group size = 64.

0 Likes

Referring to http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=117364&enterthread=y MicahVillwom told the best group size is 8x8 or 16x4 along with rearranging the block size to same dimensions like I mentioned.

Right now in Brook+ the 64x1x1 is useless, you could compare it right now, it slows down much compared to PS mode and there no way ATM we could change the group size to 2D or 3D.

Only when Brook+ supports 2D or 3D group size it will reach best performance.

0 Likes

Even CAL won't allow 2D or 3D work group sizes. You might see this slow-down becuse Brook+ has to do copying between tiled stream to linear stream and vice-versa. One good way to avoid this redundant copy is to use 1D scatter stream (size < 8192) in your kernel.

0 Likes
wellwill
Journeyman III

My output stream may be larger than 8192, so I should use the 2D stream => like 256 * X. (the X is 1 ~ 1000)

As I know, the brook+ default will use the PS mode,

so if the PS mode can reach better performance,

that means I don't need to use CS to improve the performance?

 

0 Likes

Originally posted by: wellwill My output stream may be larger than 8192, so I should use the 2D stream => like 256 * X. (the X is 1 ~ 1000)

As I know, the brook+ default will use the PS mode,

so if the PS mode can reach better performance,

that means I don't need to use CS to improve the performance?

 

you need CS to reach peak performance via tuning CAL/IL by adjusting memory access pattern, adjusting block size, improving cache hit, etc.

are you ready to learn CAL/IL? that framework feels more like assembly, not a framework.

PS is fine, but we only can reach lower efficiency. Way much lower. But as low as it is, I think for such low production time, it can outperform best possible CPU implementation with multi-thread SSE for huge domain size, but not so much.

Try brook+ example then compare it with OpenCL CPU example. The code will speak for itself how much urgent we need CAL to outperform CPU. AFAIK i7 now have 100GFlops peak and if it can be reach via OpenCL then CAL is a must to outperform CPU.

0 Likes