4 Replies Latest reply on Aug 4, 2010 11:21 AM by mihzaha

    good LRU algorithm?

      do you know a good least recently used algorithm for the GPU?

      Hi again

      I have a large list of voxels. When there is no more space for new voxels, the old ones have to be deleted. I was thinking of something like this:

      1 global timer that increases every frame.

      when inserting the voxel, set the time of the voxel to the global timer

      every time a voxel is read, update the timer of the respective voxel

      when there is no more space left or the global timer reaches the end, search for the (smallest values*) and free them and reset to 0 the other values.


      (*)the global times is something small, like char. When searching, create a list of 255 values, in which we count the number of voxels which have the respective timer count (on index 0 the number of voxels which have the the timer set to 0; index 1 number of voxels with timer set to 1 ...). Count on this list which is the threshold value of the timer, so to free enough memory and do the search again and free voxels which timer values below threshold.


      O(1) in general, but once in a while O(n), and it is a big N (a voxel has about 20 bytes, and I have to keep the memory almost full).


      Do you know of any better solution?


      Thank you

        • good LRU algorithm?

          Well, if your voxel ages really are restricted to 0-255 you could keep them in bins of linked lists instead of storing the actual age. That way no searching would be required, you could just empty the first N bins until you're happy with the amount of free space. Of course, this will add some code for maintaining the lists, but I suspect this will be pretty cheap compared to cache misses across the board during your linear search.

          One advantage of freeing up whole bins is that you don't actually have to traverse all the voxels during the free-phase. You can just append entire bin-lists to your free-list.

            • good LRU algorithm?

              I don't know if I understood:

              let's say I have 3 lists

              1: v1, v5, v3, v7, v3, v3

              2: v5, v3, v1, v15

              3: v1, v3, v3, v1

              if I erase 1, I also erase voxels 1 and 3 which are new. If I search every time the lists to remove the voxels from older lists when they are inserted into newer, I think it's to time consuming.

              I don't know how good lists are because they also make things sequential (I'm new thow, so I may be wrong, maybe with enough threads, changing threads solves this problem): all threads come to the list and say "add voxel to list" and they have to wait, because only one item can be appended a a time.

              Another problem is that I'm using an octree, the root will be appended by every thread, because all threads ask the root for it's children; and then the children many times...

              Did I understand wrong?

                • good LRU algorithm?

                  I suppose I may have misunderstood how your datastructure worked. What I was thinking about was maintaining the age state in the list structure like this:

                  a) One list for each age

                  b) When you touch a voxel during computation in such a way that its age should increment, move it to the list corresponding to next age

                  Lists don't have to make things sequential. Local locks can be used to enable parallel insert and delete operations (as long as the neighboring list elements aren't used at the same time - Possibly like the situation you mentioned, where every thread could want to insert at age 1. However, using octtrees should not create contention issues, because there's no need to update the age of an item with age 1.). I don't see any reason why you would ever traverse any of these lists sequentially either, if you keep the link indices in the voxel structure.(Except possibly as a way of fetching new work units. It wasn't clear how you do that today. I suppose you either have to copy voxel data into a linear work array using a sequential algorithm or deal with contention issues. Neither are free.) Their main function would be to group data items for quick freeing anyway.

                  Another way to increase parallelism(If access to neighboring list elements is a performance killer, which it may possibly be in this case ) could be to use several sets of N bins. Taking it to the extreme and having one set for each possible local work item should even remove the need for locking during inserts and deletes, but might create a need for some kind of load balancing mechanism if the non-deleted lists were to serve as a source of work items in the next pass.(Again, it's unclear how you distribute work at the moment)

                  what I had in mind was something along the lines of the following:


                  t=0: (Process v1,v5,v3 and v8)

                  [Insert v1,v5,v3 and v8 in age list 1]

                  1: v1, v5, v3, v8



                  (Age lists by appending list N to N+1 starting at the oldest)

                  2: v1, v5, v3, v8


                  t=1: (Process v1, v5 and v15)

                  [Move v1 and v5 from age list 2 to age list 1]

                  [Insert v15 in age list 1]

                  1: v1,v5,v15

                  2: v3,v8



                  (Age lists by appending list N to N+1 starting at the oldest)

                  2: v1, v5,v15

                  3: v3, v8



                  t=2: (Process v1 and v8)

                  [Move v1 to age list 1]

                  [Move v8 to age list 1]

                  1: v1, v8

                  2: v5,v15

                  3: v3


                  (Free age>1)

                  [Append list 2 and 3 to free list]

                  1: v1, v8



                    • good LRU algorithm?

                      Yes it looks good, and the part of increasing the timer can be done while the CPU is preparing the next voxels.

                      But still, I don't know how to make it parallel:

                      every ray will want to add the voxel it encounters at the list corresponding to time 1, but an element can't be added until the last element is completed. It's crucial that the add process be really fast, because every ray hits hundreds of voxels every frame (it doesn't stop at the first encounter).

                      And yes, I'll have to copy the last list into a vector sequentially to tell the CPU which voxels are free to update, but if I didn't use lists I would have had to traverse all the voxels and copy the old ones into the list, which costs more but can be parallelized because the voxels are in a vector.