Archives Discussions

Gerry · ‎02-16-2009

Will the new SDK 1.4 be compatible with 4850x2

Hi all,

1. When will the SDK v1.4 be released?

2. Will it be compatible with 4850x2

3. Will it be compatible with 2x (4850x2) in a system?

Thanks so much for sharing your thoughts.

Gerry

udeepta · ‎02-17-2009

1. In March.

2. We plan to test against 4870X2; i think it is safe to assume 4850X2 will also work.

3. Should work.

Please note that all the above are forward looking statements from me and not guarantees from AMD.

ryta1203 · ‎02-17-2009

udeepta,

Will local arrays be supported in Brook+?

udeepta · ‎02-17-2009

Unfortunately, no. This feature has been at the highest priority level since the 1.3 dev cycle, but we encountered some issues that prevented its completion within our timeline for 1.4.

I have gone through the pain of working around this -- by hard-coding each element of the local array (eg, a0, a1....,a19 instead of a[20]). For now, that may act as a stone-age workaround for local arrays.

For example, I had to sort 25 values in my kernel.

float buf[25];
for (k = 0 to 23)
for (l = k+1 to 24)
swap_if_less(buf, buf);

Instead, I unrolled the loop

float buf00, buf01, ... , buf24;

swap_if_less(buf00, buf01);
swap_if_less(buf00, buf02);
...
...
swap_if_less(buf23, buf24);

Very painful, but it did the job. And faster than a for-loop would have been.

ryta1203 · ‎02-18-2009

udeepta,

I am VERY VERY VERY sorry to hear that, this really limits Brook+ IMO and makes coding for it a big pain for a lot of applications. I have been waitinf for this for awhile since I thought it would be in 1.4. OpenCL might be out by the time this thing gets in Brook+, so why would anyone bother with Brook+ since OpenCL will undoubtedly have this capability?

I actually have some code I want to port to the GPU that needs to sort 129 items. That would take me quite a while to code, much more of a pain then your 24 example.

I'm not even considering doing this in CAL, the code would be a big pain.

MicahVillmow · ‎02-18-2009

Ryta,
From a performance perspective only, using the global buffer for this would be higher performance. The only difficult part is you have to calculate your offsets yourself which can be done by base_offset_for_data + data_size_per_thread * thread_id + index. Unlike our competition, our local arrays do not reside in some special memory that is very close to the ALU cores. Our local arrays reside in main memory and the only thing that the driver does is guarantee that no thread writes outside of its local space.

For example:
Ignoring my syntactic errors
kernel void sort(float4<> input, float4<> output)
{
float4 local[129];
local[0] = input[idx];
local[1] = input[idx + 1]
for (int x = 2; x < 129; ++x) {
local = local[x - 1] + local[x-2];
}
}

Will perform better if written as:
kernel void sort(float4<> input, float4<> output, float4[] local, int numdataperthread)
{
float4 local[129];
int lIdx = numdataperthread * idx;
local[lIdx] = input[idx];
local[lIdx + 1] = input[idx + 1]
for (int x = lIdx + 2; x < 129; ++x) {
local = local[x - 1] + local[x-2];
}
}

oscarbarenys1 · ‎02-18-2009

Hi,

will AMD Stream 1.4 contain the CAL OpenGL Extension headers?

..and some documentation added to the programming guide like for DirectX interop..?

I say that because CAL returns that CAL OpenGL extension is supported

(at least on catalyst 9.1 using Lavalys Everest 5.00) ..

if not can somebody post an expected timeframe..

at least I hope that will be added to OpenCL SDK as is layered on top of CAL and has

Opengl interactions

rahulgarg · ‎02-18-2009

I am also interested in oscar's question.

Archives Discussions

Stream SDK 1.4