8 Replies Latest reply on Jun 18, 2009 9:28 AM by Raistmer

    Stream with elements of user-defined type - how?

      when these errors will disappear?

      struct gpu_ap_signal{
      int time_series[64];
      int time_series_len;
      int peak_bin;
      float peak_power;
      int scale;
      double period;
      int ffa_scale;
      int n_client_bins;
      };//R: parts of ap_signal needed in GPU code
      struct gpu_ap_signals{
      int num_of_signals;
      struct gpu_ap_signal signal[30];
      };//R: this type will be used as element of output stream in kernel call

      kernel void GPU_FFA_kernel(float data[],int n_bins,float min_freq,out struct gpu_ap_signals s<>){

      ERROR--1: Stream element type not supported
      Statement: out struct gpu_ap_signals s<>

      float gpu_temp[4096];
      ERROR--3: Problem with Array variable declaration: Local Array not supported yet
      Statement: float gpu_temp[4096]

      No global variables at all , no local arrays, no more or less complex structures...

      ERROR--7: Problem with call expression in kernel: kernel can't call a non-kernel
      Statement: int_log2(per_int) in max_coadd = int_log2(per_int)

      So, no callable functions? Is it possible to use macros at least ???
        • Stream with elements of user-defined type - how?

          1. That structure is quite complex, you can't use arrays inside structures in the current version, I also think it doesn't support mixing float and double types. You have a very simple working example in: "BROOK\samples\legacy\tests\struct"

          A workaround is to use structure members as individual kernel parameters.  AMD people tells that even if you use structures the compiler transforms them in simple parameters (However if you need to use structures anyway, I think that Brook+ compiler makes this faster than you packing and unpacking data for the GPU).

          Note that if your kernel requires too many input/output streams the compiler will split your kernel in several passes and it will be slower.

          2. That's right, I also think not having local arrays is a big restriction. It would be useful even if they just unroll arrays in simple GPU register operations (by the way, I think even if local arrays are suported 4096 elements is quite big to fit in the registers of a single thread).

          3. Well, that isn't really a problem, is just that you can't call CPU code from GPU kernels. You still can use functions and macros:

          - If you want to use those functions within GPU code define them as kernels, it will work as long as you don't use recursive calls. For example "kernel int next(int i) { return i + 1; }" can be called normally from other kernels.

          - If you want to enable macros use "-pp" flag when calling BRCC compiller to enable the preprocessor.


            • Stream with elements of user-defined type - how?
              The fact that this structure exampel exists only in "legacy" part of samples very alarming. New samples set contains example of stream declarations only with basic (like float & float4) data types... Regress ?....

              I need to all these data as result of kernel work.
              Is it possible to use 3D streams?
              Or is it possible to use such kernel:

              void kernel k(float s1[][], float s2[][], float s3[][], ..., float sN[][],
              out float o1[][], ..., out float oM[][], out float<>)
              (that is, both scatter 2D and simple 1D streams in output and many gather streams in input )

              2) Actually I didn't think about array of registers. I just need a way to allocate array in GPU memory from kernel. Surely it will be slower than array of registers, but number of registers very limited... Is it posible to get access to GPU memory inside kernel in other way than to declare some input stream for it?
              I need some pretty big temporary buffers inside of kernel (that is, big (bigger than register set) amount of memory with read/write access )

              To split this complex kernel to many simple kernels is not an option from performance point of view. I already passed that way - kernel setup overhead too inhibiting to approach with many simple kernels be useful :(