7 Replies Latest reply on Apr 25, 2011 8:51 AM by diepchess

    clGetDeviceInfo

    diepchess
      Bug reports and/or questions/requests

      Good Morning!

       

      I printed some information with opencl from the system and see it report data i have questions about.

       

      Now it's great if i get back some answers, and if so if you split it in multiple postings or subjects or whatever. So apologies i post it all at once if you'd prefer more than 1 posting, let me know...

      See attached text to this. Just cut'n paste it please to fixed font width to see it better formatted.

       

      Question 1: it reports the GPU has 1GB ram and the CPU has 10GB ram. That 10GB of the quad socket box is correct. Yet I bought a 6970 XFX with 2GB ddr5. The label on the box i bought says: HD 6970 880M 2GB ddr5 dual dp hdmi dual dvi pci-e. 

      Let's start with the most likely possibility: opencl reports the gpu device RAM wrong.

       

      Question 2: i bought a 2GB RAM device, and 2GB is nowadays really little especially with 1536 streamcores, in order to use it. What sketches my amazement that from the amount of RAM it finds, it just allows an object to use 25% of that. 

        a) can that get raised to the amount of RAM it has?

        b) why this strange 25% limit? Suppose you buy a formula 1 car with 950

            horse power and you can just use 240 horsepower. Good deal? Or is your

            next car then a nvidia? It makes no sense to limit this.

       

      Question 3: It reports correctly the GPU has 24 compute cores. Using which other information can i now calculate that i have 1536 PE's available to me at the gpu?

      How do i accomplish that, with which function call or setting or mathematical formula? I scanned the entire document opencl-1.1-rev33.pdf but couldn't figure it out.

      Please enlighten me.

      Note that it does report the number of cores correct of the 16 core box 8356.

       

      Question 4: It returns for the GPU's CL_DEVICE_MAX_CLOCK_FREQUENCY = 0

      whereas for the CPU it reports correctly it is 2310 (Mhz). How do i figure out the frequency setting of the GPU?

       

      Question 5: it reports for the CL_DEVICE_GLOBAL_MEM_CACHE_SIZE = 0,

      yet i thought the 6000 series has a Global Data Store. Can you enlighten me there? 

       

      There is lots of things wrong in the reporting on the CPU settings.

      Question 6: It correctly doesn't show the GPU to be out of order. Yet it doesn't with the CPU. Hope i interpreted it correctly as this command queue gets listed under 'execution model' chapter 5. The cpu doesn't have a bit set indicating 

      at the field set to:  CL_DEVICE_QUEUE_PROPERTIES = CL_QUEUE_PROFILING_ENABLE

      that CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is set.

       

      Question 7: It doesn't report the opteron box to have ECC capabilities at the setting CL_DEVICE_ERROR_CORRECTION_SUPPORT

       

      Question 8: I really don't understand the numbers it prints at the cpu at all sorts of things like global_mem_cache_size. It reports 64KB. Really the SRAM of the opteron cpu to the RAM is a lot more. Something like 4MB all 4 cpu's together or so?

      *please note SRAM at cpu's == L3 cache

       

      Question 9:

      It's reporting local mem size to be 32KB yet the cpu has 64KB L1 datacache for each core, so it can easily report 64 there as well. 

       

      Question 10: is again on the GPU. How do i figure out it's a XFX? The command in linux 'lspci -v' seems to know somehow in linux what sort of videocard it is, but in opencl i don't see that text anywhere. Just that it is a cayman i get back.

      Please enlighten me. Note that lspci -v shows the videocard to have 256MB ram, which is wrong as well. Would this impact videoperformance during some tests at some websites? That would be very bad news for AMD of course if so,

      as the system guessing it can use 256MB whereas the card has 2GB is quite a bad idea. Anyone?

      Thanks for having me till so far,

      I'd argue that's enough for now. 

      Vincent

      diep@xs4all.nl

      skype: diepchess

      Number of Platforms found : 1 PROFILE = FULL_PROFILE VERSION = OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) NAME = AMD Accelerated Parallel Processing VENDOR = Advanced Micro Devices, Inc. EXTENSIONS = cl_khr_icd cl_amd_event_callback cl_amd_offline_devices Number of devices found (and added) 2 at platform 0 Querying device = 0 DEVICETYPE = GPU CL_DEVICE_NAME = Cayman CL_DEVICE_VENDOR = Advanced Micro Devices, Inc. CL_DRIVER_VERSION = CAL 1.4.900 CL_DEVICE_PROFILE = FULL_PROFILE CL_DEVICE_VERSION = OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) CL_DEVICE_OPENCL_C_VERSION = OpenCL C 1.1 CL_DEVICE_EXTENSIONS = cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt CL_DEVICE_GLOBAL_MEM_CACHE_TYPE = NONE CL_DEVICE_LOCAL_MEM_TYPE = LOCAL MEMORY (SRAM OR DEDICATED) CL_DEVICE_EXECUTION_CAPABILITIES = CL_EXEC_KERNEL CL_DEVICE_EXECUTION_CAPABILITIES = CL_QUEUE_PROFILING_ENABLE CL_DEVICE_MAX_MEM_ALLOC_SIZE = 268435456 CL_DEVICE_GLOBAL_MEM_CACHE_SIZE = 0 CL_DEVICE_GLOBAL_MEM_SIZE = 1073741824 CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE = 65536 CL_DEVICE_LOCAL_MEM_SIZE = 32768 CL_DEVICE_VENDOR_ID = 4098 CL_DEVICE_MAX_COMPUTE_UNITS = 24 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3 CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR = 16 CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT = 8 CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT = 4 CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG = 2 CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT = 4 CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE = 0 CL_DEVICE_PREFERRED_VECTOR_WIDTH_HALF = 0 CL_DEVICE_NATIVE_VECTOR_WIDTH_CHAR = 16 CL_DEVICE_NATIVE_VECTOR_WIDTH_SHORT = 8 CL_DEVICE_NATIVE_VECTOR_WIDTH_INT = 4 CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG = 2 CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT = 4 CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE = 0 CL_DEVICE_NATIVE_VECTOR_WIDTH_HALF = 0 CL_DEVICE_MAX_CLOCK_FREQUENCY = 0 CL_DEVICE_ADDRESS_BITS = 32 CL_DEVICE_MAX_SAMPLERS = 16 CL_DEVICE_MEM_BASE_ADDR_ALIGN = 32768 CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE = 128 CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE = 0 CL_DEVICE_MAX_CONSTANT_ARGS = 8 CL_DEVICE_MAX_WORK_ITEM_SIZES = (256,256,256) CL_DEVICE_MAX_WORK_GROUP_SIZE = 256 CL_DEVICE_MAX_PARAMETER_SIZE = 1024 CL_DEVICE_PROFILING_TIMER_RESOLUTION = 1 CL_DEVICE_ERROR_CORRECTION_SUPPORT = FALSE CL_DEVICE_HOST_UNIFIED_MEMORY = FALSE CL_DEVICE_ENDIAN_LITTLE = TRUE CL_DEVICE_AVAILABLE = TRUE CL_DEVICE_COMPILER_AVAILABLE = TRUE Querying device = 1 DEVICETYPE = CPU CL_DEVICE_NAME = Quad-Core AMD Opteron(tm) Processor 8356 CL_DEVICE_VENDOR = AuthenticAMD CL_DRIVER_VERSION = 2.0 CL_DEVICE_PROFILE = FULL_PROFILE CL_DEVICE_VERSION = OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) CL_DEVICE_OPENCL_C_VERSION = OpenCL C 1.1 CL_DEVICE_EXTENSIONS = cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_media_ops cl_amd_popcnt cl_amd_printf CL_DEVICE_GLOBAL_MEM_CACHE_TYPE = READ AND WRITE CL_DEVICE_LOCAL_MEM_TYPE = GLOBAL MEMORY CL_DEVICE_EXECUTION_CAPABILITIES = CL_EXEC_KERNEL | CL_EXEC_NATIVE_KERNEL CL_DEVICE_EXECUTION_CAPABILITIES = CL_QUEUE_PROFILING_ENABLE CL_DEVICE_MAX_MEM_ALLOC_SIZE = 2626032640 CL_DEVICE_GLOBAL_MEM_CACHE_SIZE = 65536 CL_DEVICE_GLOBAL_MEM_SIZE = 10504130560 CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE = 65536 CL_DEVICE_LOCAL_MEM_SIZE = 32768 CL_DEVICE_VENDOR_ID = 4098 CL_DEVICE_MAX_COMPUTE_UNITS = 16 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3 CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR = 16 CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT = 8 CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT = 4 CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG = 2 CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT = 4 CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE = 0 CL_DEVICE_PREFERRED_VECTOR_WIDTH_HALF = 0 CL_DEVICE_NATIVE_VECTOR_WIDTH_CHAR = 16 CL_DEVICE_NATIVE_VECTOR_WIDTH_SHORT = 8 CL_DEVICE_NATIVE_VECTOR_WIDTH_INT = 4 CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG = 2 CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT = 4 CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE = 0 CL_DEVICE_NATIVE_VECTOR_WIDTH_HALF = 0 CL_DEVICE_MAX_CLOCK_FREQUENCY = 2310 CL_DEVICE_ADDRESS_BITS = 64 CL_DEVICE_MAX_SAMPLERS = 16 CL_DEVICE_MEM_BASE_ADDR_ALIGN = 1024 CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE = 128 CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE = 64 CL_DEVICE_MAX_CONSTANT_ARGS = 8 CL_DEVICE_MAX_WORK_ITEM_SIZES = (1024,1024,1024) CL_DEVICE_MAX_WORK_GROUP_SIZE = 1024 CL_DEVICE_MAX_PARAMETER_SIZE = 4096 CL_DEVICE_PROFILING_TIMER_RESOLUTION = 1 CL_DEVICE_ERROR_CORRECTION_SUPPORT = FALSE CL_DEVICE_HOST_UNIFIED_MEMORY = TRUE CL_DEVICE_ENDIAN_LITTLE = TRUE CL_DEVICE_AVAILABLE = TRUE CL_DEVICE_COMPILER_AVAILABLE = TRUE

        • clGetDeviceInfo
          diepchess

          Please note that there is a small printing error in the output in the first field, yet not in the query to the function clGetDeviceInfo (verified).

          the second line 'CL_DEVICE_EXECUTION_CAPABILITIES'

          should be  CL_DEVICE_QUEUE_PROPERTIES .

           

          Apologies for possible confusion there in question 5.

            • clGetDeviceInfo
              himanshu.gautam

              2.  Currently the whole of GPU is not available for GPGPU usage. This has been a request for a time now. You can see updates here:

              http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=149197&forumid=9

              3. It is a simple formula. Earlier each compute unit had 80 stream processors(16*5), now each has 64 Stream Processor (16*4).

              So 24 * 64 = 1536.

              5. Execution capabilities are shown appropriately. They are not suposed to show whether out-of-order queue is supported or not. All the more OUT_OF_ORDER execution is not supported on AMD GPUs.

               

               

                • clGetDeviceInfo
                  diepchess

                  Thanks for your very quick answer.

                   

                  2. The original request was to have more than 128MB in that thread and instead move to the 25% as described in the opencl manual. My request is to get rid of the 25% rule and make it a 100% rule.

                  3. what is needed in openCL in such case is functionality that shows

                      3a) how many PE's there are in each compute unit

                      We don't want to hardcode this of course. Realize how many scientists

                      use open software as i write right now for this gpu.

                      If all of them must change hardcoded defines in future, that's not

                      gonna work. Is it possible to write a request for next release to add this?

                      And a few more, but let's skip that until this gets adressed,

                       but for example you also want to know how many PE's form the vector as described (answer right now is 4 for the 6000 series, yet opencl must have functionality to

                      give that number '4' as an answer, rather than me hardcoding it).

                      yet this is less relevant for now. Relevant is 3a: for future and other gpu's

                      we simply want to automatically calculate the number of PE's.

                   

                  5. indeed the out of order status of the gpu is correctly displayed.

                  the out of order status of the CPU gets WRONG displayed. AMD opteron cpu's are all out of order cpu's. This gets displayed incorrectly right now. the CPU doesn't carry the out of order flag right now.

                   

                  Thanks in advance,

                  Vincent

                   

                  Originally posted by: himanshu.gautam 2.  Currently the whole of GPU is not available for GPGPU usage. This has been a request for a time now. You can see updates here:

                   

                  http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=149197&forumid=9

                   

                  3. It is a simple formula. Earlier each compute unit had 80 stream processors(16*5), now each has 64 Stream Processor (16*4).

                   

                  So 24 * 64 = 1536.

                   

                  5. Execution capabilities are shown appropriately. They are not suposed to show whether out-of-order queue is supported or not. All the more OUT_OF_ORDER execution is not supported on AMD GPUs.

                   

                   

                   

                   

                   

                   

                   

                    • clGetDeviceInfo
                      nou

                      you can increase this 25% limit with export GPU_MAX_ALLOC_PERCENT=100 but it is unofficial and unsupported.

                      use prefered vector width*prefered work group size*compute cores

                      no AMD OpenCL dont support out of order execution. read OpenCL specification what is mean by out of order execution. it is not out of order insruction execution.

                      local memory is not mapped to L1 cache. in fact it is emulated in RAM. for more read OpenCL programing guide from AMD.

                      lspci -v IMHO report mapped memory region to comunicate with you card. it dont show how memory have your card.

                        • clGetDeviceInfo
                          diepchess

                           

                          Originally posted by: nou you can increase this 25% limit with export GPU_MAX_ALLOC_PERCENT=100 but it is unofficial and unsupported.

                           

                          use prefered vector width*prefered work group size*compute cores

                           

                          no AMD OpenCL dont support out of order execution. read OpenCL specification what is mean by out of order execution. it is not out of order insruction execution.

                           

                          local memory is not mapped to L1 cache. in fact it is emulated in RAM. for more read OpenCL programing guide from AMD.

                           

                          lspci -v IMHO report mapped memory region to comunicate with you card. it dont show how memory have your card.

                           

                          Thanks for your very useful reply at a few points i posted. I had missed surprisingly the opencl programming guide. Downloaded latest version of it. Had done so far with just the opencl 1.1 specs. Seems silly huh?

                          I see indeed a clGetKernelWorkGroupInfo function call, yet seems you need first a kernel for this. This isn't nformation about the kernel you just compiled, but a system wide setting? 

                           

                          So that gives 64 at a 6000 series card and 80 at 4000/5000 series?

                           

                          We'll check it soon for the 4000+ series as well.

                           

                          The out of order i'll study later whether it's a relevant question mine, considering your answer i must withdraw that question, as i interpreted it as 'out of order' as in 'OoO' processing as opposed to RISC.
                          Didn't checkout the lspci source code yet, not sure it is relevant. Relevant is i want a manner to see how much RAM the gpu has, right now it lists 1 GB, which seems a bit little to me as the box i have here sold it to me being 2GB, which was a serious reason to buy that gpu, and that's not a joke.
                          As for exports or whatever to get things done that are not supported; obviously i want to use the RAM that the device has. 2GB already is so so little for specific tasks considering there is 1536 PE's, or better 24 compute units. 
                          It's a serious issue really, if you can only adress 25% of the RAM officially, whereas it's just raising something in a manual.
                          I cannot remember for example for my chessprogram when was last time i used less than or equal to 256MB ram for it. That must've been in 2000 or so when i had a P3. After that the K7 dual i had, had 512MB which already was quite little compared to other folks. So i used 400MB ram then out of that.
                          At the 1024 processor Origin where i had a 512 processor partition i was more careful. Used out of the 512GB i had around 250 GB or so. That was in 2003.
                          The itanium2 box allowed more, yet i didn't officially run on it, only some testing.
                          Moving back to 256MB is psychologically gonna hurt...
                          Regards,
                          Vincent

                           



                            • clGetDeviceInfo
                              himanshu.gautam

                              diepchess,

                              clGetKernelWorkGroupInfo does not return 64/80 for specified devices. The CL_KERNEL_WORK_GROUP_SIZE returns the workgroup size( probably 64,128 or 256). This means these many threads can be supported by the workgroup for this particular kernel. Larger kernels requiring most GPR and LDS resources are allowed small workgroup size and smaller kernels are able to handle upto 256 workitems.

                              http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetKernelWorkGroupInfo.html

                              Well yes Out or order execution is not supported, although you might aheive some overlapping using separate queues and using events appropriately. Although nou's trick might help you in using more than 256MB of memory it is officially not supported. Try to divide your problem in parts so that at any time 256MB RAM is sufficient.

                              And we are working towards providing more RAM available to users.