cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

apaschos
Adept I

Cuckoo hashing in OpenCL

Hi all,

I'm trying to implement parallel cuckoo hashing in opencl (gpu only) based on the dissertation found here Dan Alcantara's homepage >> Real-time Parallel Hashing on the GPU

Since it is part of a bigger project, I decided to create a standalone application. The thing is I discovered additional errors on the standalone, so I'm gonna have to post these ones first.

So the problem is in this piece of code:

prefixsum(BCount, BStart, totalbucks); //do the prefix sum to get the starting index for each bucket

barrier(CLK_GLOBAL_MEM_FENCE);

if(gid < SIZE) {

        // Store the key and its index in the new global position

        offsets[gid].x = offset;

        offsets[gid].y = bucket_id;

        offset += BStart[bucket_id]; // BStart gives back wrong values, although bucket_id is correct

        Shuffled[offset].x = key;

        Shuffled[offset].y = gid;

}

For some reason, BStart returns many zeros (it shouldn't), although prefix sum for BCount and bucket_id are correct (have been exported to cpu and then to file and checked).

Attached is the code for both host and device.

I'm working on Xubuntu 12.04 64-bit 3.2.0-48-generic

GPU: AMD Radeon HD 7700 Series GHz Edition

Driver Version: Catalyst 13.4

I've tried on Windows 8 64-bit with same gpu and latest drivers (downloaded about a week ago) and encountered the same problems.

Thanks in advance,

Andrew

Edit: Forgot to attach the code!

0 Likes
1 Solution

Hello again,

Turns out the problem is on your side.

After looking at this thread http://devgurus.amd.com/message/1298947#1298947 , I tried without #pragma unroll and everything seems to work now.

The funny thing is nvidia driver has the same problem...

I'm attaching cleaner code (with comments on the #pragma unroll) in case you want to check what's going on.

Thanks for the support and a big thanks to mulisak,

Andrew

View solution in original post

11 Replies
gbilotta
Adept III

From a cursory look at the code it would seem that you are trying to do kernel-wide synchronization, using CLK_GLOBAL_MEM_FENCE barries as if they were global (cross-workgroup) synchronization points. They aren't. In fact, you cannot do global synchronization in OpenCL (or any other gpgpu programming model): the only sync points you can have are at the workgroup level (you can't sync a workgroup with another workgroup because they might not be running at the same time). If you need global syncs, you will have to split your kernels and do each part between two global syncs as a different kernel.

Depending on what you need global syncs for, you can sometimes use global atomics to do some coordination, but not in the way you seem to be trying to do in your implementation.

himanshu_gautam
Grandmaster

It feels the same to me too. You seem to be trying to achieve global synchronization within a kernel, which is not possible.

apaschos
Adept I

Ok I hadn't realized that. Thanks for the fast reply.

In fact in my application I had broken the first part into kernels and it was working.

So now the actual problem. In the cuckoo hashing (last kernel), each workgroup works on its own.

It seems as if the local_mem_fence does not synchronize all the threads in the workgroup.

hltables[cur_ofst].x = value_in; // Insert your value and hope no one else overwrites it!

barrier(CLK_LOCAL_MEM_FENCE); // Synchronize all the workgroup threads so that the following read makes sense

value_out = hltables[cur_ofst].x; // Now check if your value was actually inserted

if(value_in == value_out) { // If it was inserted, write the index too. Then loop aimlessly...

    hltables[cur_ofst].y = key_index;

    hashed = true;

}

When I later check for the hltables[cur_ofst].x, it is not the same with value_in, which probably means that someone overwrote its value.

I thought that barrier(CLK_LOCAL_MEM_FENCE) ensures memory ordering inside one workgroup.

PS. I attached the updated code in the original post as cuckoo2.tar.gz (seems like I can't attach code on a reply)

PS2. I don't know if I should mark your replies as answers since there is a second part in the question. In addition, I can't find any link to forum rules regarding answers.

Thanks in advance,

Andrew

0 Likes

Looks like your sample needs boost to compile. Any more surprises, in case i try to compile it?

The code anyways looks complex, as i have no idea what cuckoo is. Maybe you can create a smaller testcase, which is easy to compile and run by other developers.


PS. I attached the updated code in the original post as cuckoo2.tar.gz (seems like I can't attach code on a reply)


PS2. I don't know if I should mark your replies as answers since there is a second part in the question. In addition, I can't find any link to forum rules regarding answers.



Thanks in advance,


Andrew


You can attach a code anywhere. You should use advanced editor for that. Regarding marking any reply as answered is totally on your discretion

0 Likes


himanshu.gautam wrote:



Looks like your sample needs boost to compile. Any more surprises, in case i try to compile it?


The code anyways looks complex, as i have no idea what cuckoo is. Maybe you can create a smaller testcase, which is easy to compile and run by other developers.





I decided to use boost (1.46.1) because it's cross platform and makes my life easy with strings and random numbers. It would be difficult to replace it with something simpler and still cross platform.

There won't be any more surprises I believe.

The code is pretty much the small testcase, so I'll try to explain it a bit

What we want

We have an array of (key, value) pairs and we want to store them in a way that is fast to retrieve a specific pair.

On this piece of code we focus on building the hash table, not retrieval.

What we do

We use a two-level hashing scheme, with the first level shuffling the input pairs and the second implementing the cuckoo hashing. The first 3 kernels is the first part, the last kernel is the second part.

In the first part, all threads work together, so if one fails, all of them restart. In the second part, each workgroup takes one part of the shuffled data and works independently, so if one thread fails, all threads in this workgroup restart (will understand later...).

Right now, we will ignore the first part and focus on the cuckoo hashing.

Theory

Cuckoo hashing is a dynamic hashing procedure, which means that the position of its key on the hash table is not based on a deterministic function, but a probabilistic one.

The whole hash table is broken into a number of subtables, 3 in our case (SUBTABLES in my code).

On the serial version of the algorithm, its key draws a random number and, based on that, tries to enter its value on the first subtable. If another key has already entered its value on this position, it draws another random number and tries again on the next subtable.

The procedure continues until all pairs have been written in an empty location.

If a pair hasn't managed to get into a subtable, the hash table is destroyed and the procedure restarts with different seeds for the random numbers. Hopefully, after a number of attempts, the table will be built...

Note that the hash table is bigger than the input, in order to minimize the conflicts. For example, if we have 100 pairs, the hash table will have size 100(1+gamma) pairs, where 0<gamma<1. That means that in the end the table will have some empty pairs.

In order to retrieve the pair, we also need to store the random numbers used when building the table.

GPU Cuckoo

Now on the parallel gpu version, the cuckoo hashing is performed inside a workgroup. Each workgroup initializes in local memory a hash table (hltables) with (key,value)=(MAX_UINT, MAX_UINT), where MAX_UINT=0xffffffff.

Instead of having 3 subtables, I keep one (hltables) and move to the right index by calculating the offset of each subtable

cur_ofst = tries * subtable_size; // The index at each subtable

Then I calculate the index at the subtable based on the function given in the algorithm and add it to the cur_ofst

bucket_id = ((newRandoms[tries].x + newRandoms[tries].y * value_in) % PRIME) % subtable_size;

cur_ofst += bucket_id; // Update the offset with the index in the subtable

After that, each thread enters its value to the subtable, waits at the local barrier and then checks if its value managed to remain at the table. If not, it moves to the next subtable

hltables[cur_ofst].x = value_in; // Insert your value and hope no one else overwrites it!

barrier(CLK_LOCAL_MEM_FENCE); // Synchronize all the workgroup threads so that the following read makes sense

value_out = hltables[cur_ofst].x; // Now check if your value was actually inserted

If all goes well, all threads have entered at one subtable their pair. If not, the threads that failed signal the rest with the variable alert and the cuckoo hashing restarts

for(attempts=0; attempts<MAX_ATTEMPTS; attempts++) {

...

barrier(CLK_LOCAL_MEM_FENCE);

if(!alert) break; // if nobody has alerted failure, break and save the hashtable

}

In the end, each workgroup copies the hash table it built from local memory to global and saves the seeds for the random numbers that built the table.

Problem

I write the Cuckoo hash table in a file for checking.

This table should contain unique values (except for those that are still 0xffffffff) in the range [0, num_uniqs-1] but doesn't. it looks like some of them are written in contiguous positions and some are overwritten.

References

On the page I link in my first post, there is a dissertation and a paper.

In the dissertation, there is an implementation for cuda in pages 68-69.

In the paper, there is a description of the algorithm in page 4 under paragraph Phase 2.

Thank you for the interest,

Andrew Paschos

EDIT

Removed boost completely.

Compiled in linux and windows (Visual Studio 2010).

Attached the updated code

I found what I was doing wrong.

I was writing:

if(something) {

  ...

  barrier(CLK_LOCAL_MEM_FENCE);

}

else {

  barrier(CLK_LOCAL_MEM_FENCE);

}

which is wrong!

After reading again the specification, I saw this:

"If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier."

Attaching the new code.

Tested only on linux but should work on windows too.

0 Likes

Good to know, you found the issue yourself. Thanks for sharing your experiences.

0 Likes

Hello again,

Although the algorithm seemed to work correctly, I recently found out that this is not always the case.

The algorithm is written to work for arbitrary size, nevertheless it does work for rather small input size, up to 2048 elements. For bigger input, it does not always work (sometimes it does, sometimes it doesn't).

After looking thoroughly over the code, I have narrowed down the source of problem in the same part as the first problem, in the second step.

Which means I still don't understand some synchronization concepts, especially concerning local memory.

Which brings me to the next question:

In each work group, I use some of the available work items, let's say 150/256, by writing

     lid = get_local_id(0);

     if(lid<size) {...} // size in this case is 150.

(The mistake I did earlier was putting the barriers inside the if statement...)

I have a for loop, in which I initialize some local variables at its beginning, I perform a local memory barrier after that and then I perform the actual work.

     for(attempts=0; attempts<100; attempts++) // all treads in a work group execute the same number of iterations

     {

          // initialize local variables

          barrier(CLK_LOCAL_MEM_FENCE); // <- is this needed? **

          // copy from global memory

          async_work_group_copy();

          wait_group_events();

           for(i=0; i<3; i++) // all treads in a work group execute all of the 3 iterations

           {

               if(lid<size)

               {

                    // do stuff in local memory

               }

               barrier(CLK_LOCAL_MEM_FENCE);

          }

          // do more stuff in local memory

          barrier(CLK_LOCAL_MEM_FENCE);

     }

Is there any synchronization problem in this logic? I repeat this algorithm sometimes works, sometimes doesn't. When it doesn't, it does not work for some of the work groups (the rest work just fine).

** Side question: In the beginning of the first loop, I use async_work_group_copy() and wait_group_events(). Before this copy, I initialize some (other) local variables. Is it necessary to put a barrier or wait_group_events() is enough?

0 Likes

Hi,

Since you are initialzing the local variable, barrier is not required here.

Also in the loop you are doing something with local memory right. i dont know exactly what you are doing but i can guess like if you are reusing /updating the values of local variable then all the threads whcih are < size try to do at same time. This is one thing which you need to take care.

One more thing which you can try is instead of LOCAL MEM FENCE try GLOBAL MEM FENCE and check the results.


0 Likes


himanshu.gautam wrote:



Hi,



Since you are initializing the local variable, barrier is not required here.


Also in the loop you are doing something with local memory right. i dont know exactly what you are doing but i can guess like if you are reusing /updating the values of local variable then all the threads whcih are < size try to do at same time. This is one thing which you need to take care.


One more thing which you can try is instead of LOCAL MEM FENCE try GLOBAL MEM FENCE and check the results.



      Hello,

     I'll try to make it clearer. I use a local variable alert that is (re)set to 0 when the outer loop starts.

     If a thread fails, it sets alert to 1. I'm reposting the previous code with extra stuff.

     for(attempts=0; attempts<100; attempts++) // all treads in a work group execute the same number of iterations

     {

          // initialize local variables

          if(lid==0) alert = 0;

          barrier(CLK_LOCAL_MEM_FENCE); // <- is this needed? **

          // copy from global memory

          async_work_group_copy();

          wait_group_events();

           for(i=0; i<3; i++) // all treads in a work group execute all of the 3 iterations

           {

               if(lid<size)

               {

                    // do stuff in local memory

               }

               barrier(CLK_LOCAL_MEM_FENCE);

          }

          // do more stuff in local memory

          // if a thread failed to do its work, it sets alert to 1, so that the rest of the threads find out

          if(...) alert = 1;

          barrier(CLK_LOCAL_MEM_FENCE);

          // if nobody has set alert to 1, it means everyone succeeded, so the loop breaks and the kernel finishes

          if(alert == 0) break;

     }

     The only explanation I can give from the output is that they fail to synchronize on variable alert, so some threads continue the outer loop and others don't.

    

     Why should I try with GLOBAL MEM FENCE? Kernel uses local memory. Only at the end each work group writes its final results in a specific place in global memory (no conflicts between work groups).

    

     Thank you for your time,

     Andrew

0 Likes

Hello again,

Turns out the problem is on your side.

After looking at this thread http://devgurus.amd.com/message/1298947#1298947 , I tried without #pragma unroll and everything seems to work now.

The funny thing is nvidia driver has the same problem...

I'm attaching cleaner code (with comments on the #pragma unroll) in case you want to check what's going on.

Thanks for the support and a big thanks to mulisak,

Andrew