2 Replies Latest reply on Mar 17, 2012 3:54 PM by rj.marques

    Local memory and work-groups




      Say I want to allocate X bytes of local memory and clGetDeviceInfo function returns that the maximum size of local memory is Y.


      I also have a uni-dimensional global work size of T, and a uni-dimensional local work-size of L, consequently I have W = T/L work-groups.


      How do I calculate the effective quantity of local memory? Is it just X? Or is it X*W?


      I have an AMD HD4850, and I have implemented the following example:


      - Number of work-items (1D): 384 000

      - Number of work-items per work-group (1D): 256 (the maximum for my GPU)

      - Local memory 40 bytes

      - Maximum local memory 16384 bytes


      In this scenario, clEnqueueNDRangeKernel returns the error: CL_INVALID_WORK_GROUP_SIZE. The interesting thing is if the number of work-items per work-group is set as 64, it works fine.


      What am I missing?

      Thanks for your replies


      Edit: BTW, CPU execution works fine in any of the above scenarios...is it a bug?

        • Re: Local memory and work-groups

          Looks like 4xxx hardware is a bit limited, shoe-horning opencl onto it was a stretch, particularly the local memory stuff (which probably means you're using barriers?): http://devgurus.amd.com/thread/124649


          BTW that was the first hit from searching using google for "hd4850 local work size", so a little digging would've saved you some time.


          If you've set the local memory to X bytes, it will only consume X bytes per work-group, obviously local memory sizes must be set explicitly (either hardcoded or calculated).  Hardware might be capable of executing N work-groups per processor, in which case it will be NxX physical local memory used, but you don't have direct control over that as such.