6 Replies Latest reply on Dec 21, 2009 10:37 PM by Stib

    new features in AMD OpenCL

    nou

      i found that now there is some support for extensions on GPU.

      double support for basic operators +-*/  thoug only on 58xx and 59xx series wonder if CPU is supported now.

      OpenGL/DX10 interoperability

      ICD support  nvidia and ati in the same system as two platforms.

      Stream Analyzer Profiler - hope there will be some tool to developer which do not use VS

      and new documentation like performance optimizations ISA for 5xxx card.

        • new features in AMD OpenCL
          Stib

          I tought, that you are speaking of extensions, like khr_byte_addressable_store, but i run my test program, and found nothing new there.

          I was hoping, because the new driver and so...maybe next time.

           

          EDIT: I think, i missed something. There is a NEW SDK out!

            • new features in AMD OpenCL
              Stib

              I found something interesting. There is a line in the Release Notes, page 3, 6.2 Runtime, it says:

              "On 7xx-based GPUs, the maximum work-group size is 64. Specifying a larger size can result
              in undefined behavior."

              I have an EHA4850, wich is an ATI RV770 GPU. I have an applikation, that runs with a work group of 230, and i have no problems there. Wich cards are meant then? The 3xxx series?

            • new features in AMD OpenCL
              MicahVillmow
              Stib,
              This is one of the many constraints of running OpenCL on the RV7XX based GPU's because the hardware was not created with OpenCL in mind. The problem in this specific instance is that barrier is not a hardware instruction and is a software based barrier. If you run a kernel with a group size larger than the wavefront size of the chip, then an implicit barrier is added to the end of every kernel. If you run the group size equal to the wavefront size, this barrier is optimized away. The barrier itself works in most situation, but not all, which is why the behavior is undefined.
                • new features in AMD OpenCL
                  Stib

                  I see. A tried now something. I used a work group size of 50 instead of 230, and i got not so much prformance loss, as i tought, in fact, i think its almost nothing. Is it because, i the implicit barrier is holding back the run of the other solution? Or is the barrier ONLY optimized away by a group size of 64?

                  Another question. How come, that the max work group size is 256 if i can only use 64 of it?

                • new features in AMD OpenCL
                  MicahVillmow
                  Stib,
                  Any size smaller than the wavefront size will optimize the barrier away. You can use what the runtime reports as correct, the wording in the doc is off and I'm attempting to get it corrected. It should be 'recommended' not 'maximum'. The maximum size is returned via the query CL_KERNEL_WORK_GROUP_SIZE.

                  edit: misspelled the ENUM value.
                    • new features in AMD OpenCL
                      Stib

                      Ok, becaus i got the 230 group size from the runtime. I did set the group size to NullRange, and so.

                      With 230 i get ~7,27 ms kernel time, and with 50 ~8,13 ms.

                      So, is it because the barrier, that there is so little between the two performances, or because the kernel does not so much work?

                      I attach my kernel if you'd like to see, its a simple masking algorithm, for pictures.

                      And, thank you, for the quick, and accurate answer!

                      __kernel void maszkol ( __global unsigned int *rOut, __global unsigned int *gOut, __global unsigned int *bOut, __global unsigned int *rMid, __global unsigned int *gMid, __global unsigned int *bMid, __global unsigned int *N, __global unsigned int *width, __global unsigned int *height, __global int *seged_t, __global float *maszk ) { unsigned int seged=*width*(*height); unsigned int hatar=*N*(*N); unsigned int cim=0; unsigned int i=0; float ertek=0; switch(get_global_id(0)/seged) { case 0: cim=((*N/2)*(*N-1+*width))+((get_global_id(0)/(*width))*(*N-1))+(*N/2)+get_global_id(0); for(i=0;i<hatar;++i) { ertek+=rMid[cim+seged_t[i]]*maszk[i]; } rOut[get_global_id(0)]=ertek; break; case 1: cim=((*N/2)*(*N-1+*width))+(((get_global_id(0)-seged)/(*width))*(*N-1))+(*N/2)+get_global_id(0)-seged; for(i=0;i<hatar;++i) { ertek+=gMid[cim+seged_t[i]]*maszk[i]; } gOut[get_global_id(0)-seged]=ertek; break; case 2: cim=((*N/2)*(*N-1+*width))+(((get_global_id(0)-(2*seged))/(*width))*(*N-1))+(*N/2)+get_global_id(0)-(2*seged); for(i=0;i<hatar;++i) { ertek+=bMid[cim+seged_t[i]]*maszk[i]; } bOut[get_global_id(0)-(2*seged)]=ertek; break; } }