cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

WTrei
Journeyman III

Avoid using scatter if many values have to be returned

Hi,

at first I am sorry of my poor english - I hope you're able to understand

 

I am working on implement a large integer factoring method.

Now I have a problem with my output data, because I expect factors arround 300 to 400 digits - so I have to return values of the size of 1024 - 2048  bits. In one uint4 I can store 128 bits, so using 8 normal output streams give me only 1024 bits of data - might be not enough.

I'm already using gather for input data  so I don't wan't to use scatter for my outputs, because of performance reasons. Is there any other method for returning more than 1024 bits per kernel? Another Problem is the memory use - I don't have to return the results of every kernel becauce most of them don't find a real factor of my number - any ideas how to return only the "interesting" results? 

PS: I have to use local array's while calculating. Is their performance similar to gather and scatter or like normal variables?

Thanks for reading

0 Likes
5 Replies
Ceq
Journeyman III

- Option 1, use several output streams, if you need more than 8 float4 outputs the compiler will generate automatically several kernel passes:

kernel void inc4(float4 i1<>, float4 i2<>, out float4 o1<>, out float4 o2<> ) {
    float4 t = {1.0f, 1.0f, 1.0f, 1.0f};
    o1 = i1 + t;
    o2 = i2 + t;
}

 

- Option 2, create a user defined data type that can hold more values at once.

EDIT: Looks like this option won't work if you need more than 4 float8.

typedef struct float8_s {
    float4 f1;
    float4 f2;
} float8;

kernel void inc8(float8 i<>, out float8 o<> ) {
    float4 t = {1.0f, 1.0f, 1.0f, 1.0f};
    o.f1 = i.f1 + t;
    o.f2 = i.f2 + t;
}

 

- Option 3, I didn't test this but maybe you can use shared memory so threads with different outputs could communicate their operations.

 

-------------------------------------------------------

Notes to AMD:

1. Looks like using more than 4 float8 user defined structures generate bad assembler as the second pass is missing.

2. Both inc4 and inc8 kernels generate the same assembler, however if you use float2 instead float4 as base type the generated assembler looks strange:

typedef struct float8b_s {
    float2 f1;
    float2 f2;
    float2 f3;
    float2 f4;
} float8b;

kernel void inc8b(float8b i<>, out float8b o<> ) {
    float2 t = {1.0f, 1.0f};
    o.f1 = i.f1 + t;
    o.f2 = i.f2 + t;
    o.f3 = i.f3 + t;
    o.f4 = i.f4 + t;
}

 

 

0 Likes

Currently, Brook+ compiler doesn't optimize float2's into another basic datatype float4. All the members of a struct are converted into multiple input or output streams of the same basic types defined in structure.

So, inc8 is emulated with 2 input and 2 output streams of type floa4. And, inc8b is emulated with 4 input and 4 output streams of type float2.

0 Likes

How do the kernel passes work?

I expect calculating the data ( ~ 16 uint4 values ) need a few days (!) of calculating time - would'nt be great if everything has to be calculated twice only because I'm not able to return all data in one pass. 

On the other hand: If pure calculation time is so long, would using scatter realy addect my performance?

 

0 Likes
Ceq
Journeyman III

Thanks Gaurav. Even if they're emulated sometimes is better to use a float8 user defined structure because it reads and returns consecutive data. Using two float4 could require to split the data in two float4 streams before processing and then merging it again.

--------------------

 

To WTrei: I'm not sure how they work, but if you use more than eight outputs the comopiler issues a warning and generates the additional passes. You can check this using AMD Stream KernelAnalyzer.

I'm quite surprised that your individual threads take a few days, note that large kernels with lots of loops and branches could results in bad GPU performance. Usually you should try to split your algorithms in several steps to fit better Brook+ stream programming model.

About using scatter, if your kernels are that large you could give it a try, but remember to disable "GPU recover" in the driver, otherwise your kernels would be stopped after a few seconds.

0 Likes

Even if they're emulated sometimes is better to use a float8 user defined structure because it reads and returns consecutive data. Using two float4 could require to split the data in two float4 streams before processing and then merging it again.


Brook+ has to do the same splitting and merging of data to emulate structure using multiple streams. Its recommended to avoid structures if possible.

0 Likes