Archives Discussions

ryta1203 · ‎02-20-2009

Let's assume you are using 5 GPR/thread, so you can have 51 wavefronts active (256/5 = 51).

When a wavefront is completed, does a new wavefront allocate resources and get put in the dispatcher? If so, how many wavefronts do you need above the actual executing amount to have it so that this is not noticed in performance.

What I mean is, do wavefront batches (assuming a batch is how many can run in parallel, ie. dispatcher+executing) executing serially or is there some overlapping (a new wavefront is created as soon as an old wavefront is finished)?

MicahVillmow · ‎02-20-2009

Wavefronts are created as long as resources are available. If they cannot execute right away because other wavefronts are executing, they will be put on the run queue and wait for the executing wavefronts to stall. Once an executing wavefront stalls, a wavefront from the runqueue is starts executing.

ryta1203 · ‎02-20-2009

Thanks.

So if you can have 1024 threads max on a SIMD, that means you can have 16 wavefronts max on a SIMD, yes?

So is there an advantage to having more than 16 wavefronts "running in parallel"?

Also, as a side question, what is the max 3D stream you can have? I've noticed that some 3D streams won't execute even if x*y*z < 8192x8192.

gaurav_garg · ‎02-21-2009

what is the max 3D stream you can have? I've noticed that some 3D streams won't execute even if x*y*z < 8192x8192.

If you are seeing any such behavior, that is a bug on Brook+ side. Could you post such dimensions where you see these errors? Also, could you check error on your declared 3D streams, do you see erros during stream allocation or some other operations (It might be possible that your card is out of memory)?

ryta1203 · ‎02-21-2009

In my main code (non-br file) I have an array "float4 trace[1001][1001][1]" and in my br file I have the same array as a stream "float4 trace_s<1001, 1001, 1>". This causes a crash (before you say it, I'd like to have a larger 3rd dimension but just used 1 as a test case).

I get a stack overflow error when the program begins. I'm using a 4850 512MB vid card.

gaurav_garg · ‎02-21-2009

Was float4 trace[1001][1001][1] created on stack, or it was allocated dynamically?

stack overflow error means you are allocating too much memory on stack.

ryta1203 · ‎02-21-2009

gaurav,

You are correct, I have corrected this, I was just meaning something I noticed.

However, my original questions still stand:

So if you can have 1024 threads max on a SIMD, that means you can have 16 wavefronts max on a SIMD, yes?

So is there an advantage to having more than 16 wavefronts "running in parallel"?

ryta1203 · ‎02-23-2009

Originally posted by: ryta1203

So if you can have 1024 threads max on a SIMD, that means you can have 16 wavefronts max on a SIMD, yes?
So is there an advantage to having more than 16 wavefronts "running in parallel"?

16 wavefronts is the max that each SIMD engine can run, since each can only run 1024.

So what is the benefit to being able to have more than 16, in the sense that the resources (GPR) are available to do so, since a SIMD can't run more than this anyways?

MicahVillmow · ‎02-23-2009

Ryta,

16 is the max number of wavefronts available in compute shader mode with LDS usage. In pixel shader mode this depends greatly on the kernel resource usage.

The more wavefronts executing in parallel, the more latency hiding you have. By increasing the number of wavefronts executing in parallel, it is possible to take a memory bound kernel and make it computation bound.

ryta1203 · ‎02-23-2009

Micah,

Is what mode Brook+ is in in the docs somewhere? I must have missed it, sorry.

Also, so in pixel shader mode you can have an unlimited number of wavefronts providing you have enough resources?

MicahVillmow · ‎02-24-2009

Brook+ runs currently in pixel shader mode. This can be seen from the first token of each il stream, il_ps_2_0, where as if they switch to compute shader mode it would be il_cs_2_0.

Also, the driver can limit the number of resources in pixel shader mode because it is part of the graphics pipeline and must share resources with other shader modes. This constraint does not exist in compute shader mode.

Archives Discussions

Another Wavefront Question