21 Replies Latest reply on Jan 2, 2010 8:59 AM by nou

    Basic opencl queries...

    mohit2710

      Hi,

      I was trying to understand the matrix multiplication example given in the sdk samples....

      I have some doubts..

      1.  My laptop has an intel core 2 duo processor....it consists of two cores(compute units)...is there a way I can handle the two cores...that is what if I want to use only one of the cores...not the second one...

       

      2. The example takes into consideration several computation and setting up times....but they are not shown in the output....how can i get the output of those times....

      Any kind of help is appreciated...

       

        • Basic opencl queries...
          nou

          you can set enviroment variable CPU_MAX_COMPUTE_UNITS=n

          • Basic opencl queries...
            genaganna

             

            Originally posted by: mohit2710 Hi,

            1.  My laptop has an intel core 2 duo processor....it consists of two cores(compute units)...is there a way I can handle the two cores...that is what if I want to use only one of the cores...not the second one...

            Set enviroment variable CPU_MAX_COMPUTE_UNITS to number of cores you want use.

             2. The example takes into consideration several computation and setting up times....but they are not shown in the output....how can i get the output of those times....

            Any kind of help is appreciated...

            These is a document in doc folder for each sample. that describes all options used for perticular sample.

              • Basic opencl queries...
                mohit2710

                Hi,

                I ran the matrix multiplication code for two 1024x1024 matrices and took two cases on my intel core 2 duo T6400 @ 2.00 GHz processor

                In first case I set the no. of compute units =2  and the time came out to be 35.6 sec...

                In the second case I set the no. of compute units = 1  and teh time came out to be 38 sec...

                what does these results indicate....??

                should the time taken not be double in the second case...??

                  • Basic opencl queries...
                    genaganna

                     

                    Originally posted by: mohit2710 Hi,

                     

                    I ran the matrix multiplication code for two 1024x1024 matrices and took two cases on my intel core 2 duo T6400 @ 2.00 GHz processor

                     

                    In first case I set the no. of compute units =2  and the time came out to be 35.6 sec...

                     

                    In the second case I set the no. of compute units = 1  and teh time came out to be 38 sec...

                     

                    what does these results indicate....??

                     

                    should the time taken not be double in the second case...??

                     

                    Mohit2710,

                          Please run for bigger matrices.

                    I am getting following on my Phenom Quad-core for 2048 X 2048

                          1. CPU_MAX_COMPUTE_UNITS=1

                                  202.607 sec

                           2. CPU_MAX_COMPUTE_UNITS=2

                                   109.014 sec

                     

                    Kernel time includes ReadBuffer also. To measure exactly consider only kernel execution(clEnqueueNDRangeKernel).

                     

                    Please close other applications before running this.

                      • Basic opencl queries...
                        mohit2710

                        Hi,

                        I am using ubuntu 9.04 in VMWare software...

                        My host operating system is xp..

                        I have tried to change the CPU_MAX_COMPUTE_UNITS variable...but no change in result occurs...

                        If i type 'env' in the terminal, it doesn't show any such variable....

                        Anyways, i typed 'export CPU_MAX_COMPUTE_UNITS=2' or 1 to set the variable but no change in timng occurs...

                        Am i doing wrong..??

                        Can you tell me how to exactly set this environment variable...????

                         

                          • Basic opencl queries...
                            genaganna

                            setting environment variable is right.  but you should get this variable when  env command used.  I am not sure why it is not showing in the list. 

                             

                            Write simple C program with reads environemnt variable and print value.

                              • Basic opencl queries...
                                mohit2710

                                If I type 'export CPU_MAX_COMPUTE_UNITS=2' followed by 'env', then it does show in the list..

                                But my problem here is that no matter how many compute units i select through the environment variable, the timing remains the same

                                The configuration of my computer which i am using is : Intel Xeon CPU E5405 @2.00 GHz, Quad core processor.

                                 

                                  • Basic opencl queries...
                                    genaganna

                                    mohit2710,

                                                   could you please install latest OpenCL SDK and run CLInfo sample?

                                    CLInfo sample display the device information availlable on your system. It contains "Max compute units" field. Please let me know what is the  value for that field.

                                                  You will find latest OpenCL SDK at http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx.

                                     

                                                 Not sure what is problem on your system.

                                     

                                     

                                     

                                      • Basic opencl queries...
                                        mohit2710

                                        On compiling ClInfo i get an error that gl.h is not found..

                                        This file is to be provided by ATI in its SDK which is not

                                          • Basic opencl queries...
                                            genaganna

                                             

                                            Originally posted by: mohit2710 On compiling ClInfo i get an error that gl.h is not found..

                                             

                                            This file is to be provided by ATI in its SDK which is not

                                             

                                            Please set ATISTREAMSDKSAMPLESROOT to your installed directory.

                                            Please read ATI_Stream_SDK_Getting_Started_Guide_v2.0.pdf available at http://developer.amd.com/gpu/ATIStreamSDK/pages/Documentation.aspx

                                             

                                             

                                            • Basic opencl queries...
                                              mohit2710

                                              I was able to run the CLInfo file..It shows Max compute units equal to 1, but my processor is a quad core processor

                                                • Basic opencl queries...
                                                  genaganna

                                                  Mohit2710,

                                                                  Not sure why OpenCL is getting one compute unit on your system.

                                                  could you please post more details about system information and VMWare?

                                                   

                                                    • Basic opencl queries...
                                                      mohit2710

                                                       

                                                      This is the complete output of the CLInfo file....

                                                      I am using VMWare player 3.0

                                                      What do you suggest after seeing this output?

                                                      Number of platforms:                 1
                                                        Plaform Profile:                 FULL_PROFILE
                                                        Plaform Version:                 OpenCL 1.0 ATI-Stream-v2.0-beta4
                                                        Plaform Name:                     ATI Stream
                                                        Plaform Vendor:                 Advanced Micro Devices, Inc.


                                                        Plaform Name:                     ATI Stream
                                                      Number of devices:                 1
                                                        Device Type:                     CL_DEVICE_TYPE_CPU
                                                        Device ID:                     4098
                                                        Max compute units:                 1
                                                        Max work items dimensions:             3
                                                          Max work items[0]:                 1024
                                                          Max work items[1]:                 1024
                                                          Max work items[2]:                 1024
                                                        Max work group size:                 1024
                                                        Preferred vector width char:             16
                                                        Preferred vector width short:             8
                                                        Preferred vector width int:             4
                                                        Preferred vector width long:             2
                                                        Preferred vector width float:             4
                                                        Preferred vector width double:         0
                                                        Max clock frequency:                 1995Mhz
                                                        Address bits:                     32
                                                        Max memeory allocation:             536870912
                                                        Image support:                 No
                                                        Max size of kernel argument:             4096
                                                        Alignment (bits) of base address:         1024
                                                        Minimum alignment (bytes) for any datatype:     128
                                                        Single precision floating point capability
                                                          Denorms:                     Yes
                                                          Quiet NaNs:                     Yes
                                                          Round to nearest even:             Yes
                                                          Round to zero:                 No
                                                          Round to +ve and infinity:             No
                                                          IEEE754-2008 fused multiply-add:         No
                                                        Cache type:                     Read/Write
                                                        Cache line size:                 64
                                                        Cache size:                     65536
                                                        Global memory size:                 1073741824
                                                        Constant buffer size:                 65536
                                                        Max number of constant args:             8
                                                        Local memory type:                 Global
                                                        Local memory size:                 32768
                                                        Profiling timer resolution:             1
                                                        Device endianess:                 Little
                                                        Available:                     Yes
                                                        Compiler available:                 Yes
                                                        Execution capabilities:                 
                                                          Execute OpenCL kernels:             Yes
                                                          Execute native function:             No
                                                        Queue properties:                 
                                                          Out-of-Order:                 No
                                                          Profiling :                     Yes
                                                        Platform ID:                     0
                                                        Name:                         Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz
                                                        Vendor:                     GenuineIntel
                                                        Driver version:                 1.0
                                                        Profile:                     FULL_PROFILE
                                                        Version:                     OpenCL 1.0 ATI-Stream-v2.0-beta4
                                                        Extensions:                     cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store

                                                       

                                                      VMWare info:

                                                      cat /proc/cpuinfo
                                                      processor    : 0
                                                      vendor_id    : GenuineIntel
                                                      cpu family    : 6
                                                      model        : 23
                                                      model name    : Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz
                                                      stepping    : 10
                                                      cpu MHz        : 1995.040
                                                      cache size    : 6144 KB
                                                      fdiv_bug    : no
                                                      hlt_bug        : no
                                                      f00f_bug    : no
                                                      coma_bug    : no
                                                      fpu        : yes
                                                      fpu_exception    : yes
                                                      cpuid level    : 13
                                                      wp        : yes
                                                      flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss nx constant_tsc up arch_perfmon pebs bts xtopology tsc_reliable pni ssse3 sse4_1 hypervisor
                                                      bogomips    : 3990.08
                                                      clflush size    : 64
                                                      power management:

                                                        • Basic opencl queries...
                                                          nou

                                                          well problem is vmware. i do not use vmware but in virtualbox a i can set number of cores which use a virtual machine. so i think that vmware create one cored virtual system.

                                                            • Basic opencl queries...
                                                              genaganna

                                                              Mohit2710,

                                                                            It seems you have created virtual system with one core.

                                                                • Basic opencl queries...
                                                                  mohit2710

                                                                  Thanks for your help..It was indeed the problem of VMware.That problem has been solved now, but still the time is not in propotional to the number of cores used..

                                                                  The results obtained are: input array of 2048x2048,2048x2048; blocksize of 32

                                                                  Compute unit=1     time=267s

                                                                  Compute unit=2     time=160s

                                                                   

                                                                  Is this difference considerable enough?

                                                                  Another doubt I hav is that how are the threads deployed to the hardware? Is it the same concept as that of pthreads or is there a difference here?