8 Replies Latest reply on Oct 4, 2010 4:27 PM by tomknk

    Basic questions and graphic card problems


      Hi all,

      I am new to OpenCL and tried to implement some easy min / max algorithms. But as my ubuntu had some problems with the latest drivers I worked with the cpu part of the stream sdk 2.2 for 2 weeks and after all my opencl-code runs fine on the cpu. I found a workaround for the problems with the latest catalyst driver and got the opencl capabilities of my graphic card back. I use an hd 4670 which is only opencl 1.0 compliant, and has no atomic functions so i kicked all kernels which relied on the atomic functions. But after the rest compiled correctly for the gpu, i have some additional problems.

      1) Im not sure wheter i understand the opencl processing wrong or i only implemented things wrong. I thought that every workgroup is cut into pieces of n compute units(n=number of compute units in the gpu) which are than processed at once. Eighter till the end of the kernel is reached or a barrier is hit. Then the next chunk and so on until the whole workgroup (local size) is processed. Then the next workgroup will be processed the same way until all workgroups (global size) are executed.

      Is this correct or am I wrong with this expectations?

      As I thougt i am right i implemented a kernel which uses a local buffer of 8 elements (as my graphic card has 8 compute units) as i thought each element will be used only once per 8 compute units and so the buffer will be used in a unique way each turn until the whole workgroup is processed. So the following code should work?

      sample code:

      #define PROC_NUM 8

      __kernel min ( __local int* buffer, __global const int* vector ) {

      size_t lid = get_local_id(0);

      if( vector[lid] > buffer[lid%PROC_NUM] )

      buffer[lid%PROC_NUM] = vector[lid];



      buffer[] was initialised with the first element of vector.

      As I read in this forum the execution within a workgroup is implementation dependent so i should extend the buffer to hold a full warp which is 64(amd) and 32(nvidia) as i read in some paper (or work group size?). Or is it enough to use the number of compute units as a buffer size. Is the way in which the local ids are processed specified in the opencl specs or is it wrong to expect that only one buffer item is used at all passes within the workgroup?

      2) In the first run i implemented the above kernel with a global buffer which doesn't work at all. Since the buffer elements will never be updated even if i use barrier(CLK_GLOBAL_MEM_FENCE); at the end of the code. Is their any way to use a global buffer which updates the changing values, or is that only done between workgroups or will that never be done?

      3) CLInfo states that my graphic card has a max workgroup size of 128, but when ever i use a local buffer even of the size of one unsigned int it fails without an error, when the workgroup size is larger then 32.  Only the fast execution time and the global debug buffer i use to get some debug variables (in the simplest test i used debug[0] = get_global_id(0); which is allways -1 which is the initialisation value) showed that the kernel is not executed at all. Thats why i can't test the code sample in 1) with 64 which could be necessary to execute the min-kernel correct.

      So I hope someone with more expirence and knowledge can answer some questions or guide me in the right direction.



        • Basic questions and graphic card problems

          As far as I know:

          1) yes, however on GPU there are usually many work-groups run simultaneously

          2) local memory is for the whole work-group and you can't initialize it from the CPU code. So __local int* buffer won't have any data in and the comparison (vector[lid] > buffer[lid%PROC_NUM]) is a bit of nonsence.

          get_local_id(0) will return a value (0, PROC_NUM) so %PROC_NUM is useless

          get_global_id(0) will return a value (0, NUM_THREADS) and usually server to access global buffers

          the barrier at the end of the kernel is useless

          Hope I helped a bit to clear the things you pointed out. If you have more questions or I was unclear, ask again ;-)

            • Basic questions and graphic card problems

              hi tomknk,

              I am sorry to say but you are still not clear about the ATI GPU architecture.

              Please read the first chapter of openCL Programming Guide available at AMD site.

              Let me tell you some facts:

              Each Compute unit in your GPU is able to run atleast 1 work item(which cannot be more that 128 threads in your case).So at any time there would be 128*8 workitems or 8 work groups executing concurrently.

              In addition all the suggetions provided by karbous are correct to the best of my knowledge.


                • Basic questions and graphic card problems

                  Thanks for your fast replies and for clearing some thoughts. At 1) I don't had the source code by hand and copied the old one which used the global memory instead of local and was therefor initialised. So I am sorry for the wrong code.

                  Just to understand this basic concept of how the workload will be processed, lets take an example:

                  Say I have a kernel with a global size of 256, which equals 256 threads of the kernel will be executed. And a work group size of 64. Which means I have 4 work groups with each 64 work items. As I read in some OpenCL Tutorial I thought that the workload (threads) will be executed in warps which are as mentioned earlier are 32(nvidia) and 64(ati). So for ati 64 workitems will be processed in one 'concurrent' process, and since my gpu only have 8 compute unites it takes 4 (or multiples of this) cycles to culculate the whole workgroup.

                  But as i understand your comment it is the other way around. So I have 4 workgroups and my gpu has 8 compute units. So the first two elements of each of the 4 workgroups will be executed concurrently then the next 2 and so on until all workitems are processed. Am I correct?

                  Or is both possible and neither of the two possibilities could be expected since this is up to the graphic card producer and not specified in the opencl specs?

                  I will read the amd docs since you are right i miss some basic understand or clearification cause i am a bit confused since every time i thought i understand something and is working in some easy test kernel it was only a coincidence that it worked as expected.


                  Does anybody else could also help me to investigate my problems described in point 3? May be I can post some easy code tomorrow to reproduce or get some hints when i had access to my test codes.


                    • Basic questions and graphic card problems

                      All work items from the workgroup must be run simultaneously or if the hardware can't make it, it must process e.g half of the workgroup and the other half afterwards (or switch the halves when barrier is hit) and so on.

                      To question 3) Check how much Local memory your graphic card have. It is possible that you don't catch in your code the CL_OUT_OF_RESOURCES error.

                        • Basic questions and graphic card problems

                          each compute unit will be executed on one compute unit. at least on ATI GPU. so for your example 256 work item with local size 64 it will execute four worgroup on four compute unit (so you got half performance on your card). each compute unit can execute 16 workitem in paralel so 64 work item will be executed in one wavefront which consist from four clocks per 16 units.

                            • Basic questions and graphic card problems

                              I would recommend you to check the reductions case study:


                              If you feel this is too hard to start with you should look for more basic concepts at first...

                                • Basic questions and graphic card problems

                                  Please read the first chapter of openCL programming guide.

                                  some more facts:

                                  your GPU contains 8 compute units(CU).

                                  Each CU has 16 stream cores which are capable of running one work item at one time.So 16 workitems(or a quad-wavefront) is executed concurrenntly.

                                  To hide instruction latencies(delays) four (quad_wavefronts)16*4 workitems are clubbed together, which form a complete wavefront.

                                  Even more, each stream core has 5 processing elements which search for independent instruction inside the same work item to execute them in parallel.

                                  In all your GPU can execute 8*128 workitems in one go.the 5 processing elements might be executing them also in parallel mode.

                                  Its too much to be mentioned in a post.So please read openCL programming guide.

                                    • Basic questions and graphic card problems

                                      Thanks for all your answers an suggestions, i have already adopted the reduction sample douglas125  mentioned, and also have build some code by my self which also works without problems. But as all of you already recognized some basic understandings of mine are wrong or incomplete so I already started to read the opencl programming guide and hope that my knowledge will rise soon to a level where i fully understand why some kernel i have developed work and some don't.

                                      I will investigate my second problem after i read the programming guide and I will reply to that after I found the problem.