Archives Discussions

t-man · ‎10-04-2012

Well the problem I have is the following:

I have this kernel that tries to calculate the betweenness centrality of a graph in parallel. What happens is actually very strange since a loop is executed two times by one of the work groups. In average once every 7-8 runs the second while loop ( while(count<nr_roots) ) gets executed twice by a workgroup although at the first iteration the count variable is incremented. So for my case I have a 12 vertices graph and nr_roots is 1, count is 0, the while gets executed, count gets incremented by 1, and still cout will be viewed as 0 one more time.

This happends only once every 6-7 runs, remeber that, not always. Does anyone have any idea why? I also tried making count a __local variable (shared by group) and also made it __private ( shared by the work-item only), no success. Any tips, suggestions are more than welcome!

" while ( found_local != 0){ \n" \

" \n" \

" \n " \

" if(i==0) { level_local = atomic_add(nr_level,0); atomic_xchg(found,0);\n" \

" pozition_local = atomic_add(pozition,0);\n " \

" nr_roots = atomic_add(&level[level_local],0)/j; atomic_xchg(&count,0); nr=0; rest = atomic_add(&level[level_local],0)%j; \n" \

" if(k<rest) nr_roots = nr_roots + 1;} \n" \

" \n" \

" barrier(CLK_GLOBAL_MEM_FENCE); \n " \

" \n" \

" while(count < nr_roots ){ \n" \

" \n" \

" if(i==0){ \n" \

" root = stack[pozition_local + count*j + k];\n" \

" succ_index[root] = 0; \n" \

" nr_neigh = firstnbr[root+1] - firstnbr[root]; } \n" \

" barrier(CLK_LOCAL_MEM_FENCE);\n" \

" \n" \

" neigh_per_thread = nr_neigh/size; \n" \

" if(i<nr_neigh%size) \n" \

" neigh_per_thread ++; \n" \

" h = 0; \n" \

" while(h<neigh_per_thread)\n" \

" {\n" \

" node = nbr[firstnbr[root] + size*h + i];\n" \

" \n" \

" dw = atomic_cmpxchg(&d[node], -1, level_local + 1);\n" \

" \n" \

" if(dw == -1)\n" \

" {\n" \

" atomic_inc(&level[level_local + 1]);\n" \

" atomic_cmpxchg(found,0,1);\n" \

" dw = level_local + 1;\n" \

" gh = atomic_inc(nr_stack);\n" \

" stack[gh] = node;\n" \

" \n" \

" }\n" \

"if(dw == level_local + 1)\n" \

" { \n" \

" \n" \

" temp = atomic_inc(&succ_index[root]);\n" \

" succ[firstnbr[root] + temp] = node;\n" \

" GetSemaphor2(&sem[0]); temporal = atomic_xchg(&sigma[node],0); temporal2=atomic_xchg(&sigma[root],sigma[root]); \n" \

" atomic_xchg(&sigma[node],temporal+temporal2);ReleaseSemaphor2(&sem[0]); \n" \

" } \n" \

"h++; \n" \

"} \n" \

" \n" \

"if(glob%6==1) {atomic_add(&count,1);if(root==4&&nr1==1) BC[8] = 1;} \n" \

" barrier(CLK_GLOBAL_MEM_FENCE); } \n" \

" \n" \

" barrier(CLK_LOCAL_MEM_FENCE);\n"

"if(glob==0) {f= atomic_add(&level[level_local],0); atomic_add(pozition,f); atomic_add(nr_level,1); \n" \

" } \n" \

" \n" \

" if(i==0) \n" \

" { atomic_add(global_sync,1); \n" \

" if ( k==0) { while(atomic_add(global_sync,0)< j); atomic_xchg(global_sync, 0); } \n" \

" else { while(atomic_add(global_sync,0) > 0); }} \n" \

"barrier(CLK_LOCAL_MEM_FENCE);if(i==0) found_local = atomic_add(found,0);barrier(CLK_LOCAL_MEM_FENCE);\n" \

"} if(glob==11) BC[glob] = atomic_xchg(&sigma[11],sigma[11]); } \n";

t-man · ‎10-19-2012

I managed to localize the problem. The idea was that one workgroup was going through the iteration much faster then the other, incrementing the global variable "nr_level" and making the other workgroups see a wrong value. Thank you yurtesen for all your help!

View solution in original post

yurtesen · ‎10-04-2012

You are increasing count inside the while loop, but I think, until that point many threads can go past while(count < nr_roots ) line (therefore they can enter to the while loop before one thread has chance to increase count). When that happens, two or more threads can increase count even if it goes over nr_roots Does that make sense?

t-man · ‎10-05-2012

Well, the count is increased only by a single thread "if(glob%6==1)" where glob is the global counter of the work-items, after which i make a barrier to be sure that all the work-items see that count is increased before they start the new iteration.

The point of each iteration is for each thread to go through it once, and if you see I have a "if(i==0) root =stack[].." which means that only the less significant thread should do that bit, and that gets done twice for some reason with count = 0, I tested it.

Is that logic correct, or what exactly do you mean? Thanks for your help!

yurtesen · ‎10-08-2012

Lets say threads with "glob"al id 1 and 7 entered to the while(count < nr_roots ) loop when count was zero. What would stop them from increasing the counter while they are in there? am I understanding something wrong?

By the way, having kernels in a separate file makes reading code much easier.

t-man · ‎10-09-2012

The thing here to mention is that "k"represents the group number so count is individual per work-group. so actually the two global threads 1 and 7 will modify a different pozition in the count, having count being modified by a single thread in a work-group.

I am sorry about the kernel but when i tried to place it in a separate file the code stop working for some reason .

I really appreciate any help you can give me at this point. Thank you!

yurtesen · ‎10-10-2012

How can you tell if global threads 1 and 7 will be in a different workgroup id(k)? If your workgroup size is 64, wouldnt threads between 0 and 64 will be in same workgroup id(k) ?

t-man · ‎10-10-2012

I explained it poorly, i am sorry. This code is for the test I am implying, when the workgroup is of size 6,of course 1 and 7 will work only for this case but the general question stands

Having a workgroup of size 6 why would this behavior happen?

yurtesen · ‎10-10-2012

First of all, if workgroup size was 6, then first workgroup would not increase count at all since they will have 0,1,2,3,4,5 global IDs. (glob)

if(glob%6==1) {atomic_add(&count,1)

If your 'k' workgroup id is correct and nr_roots is indeed 1 then from what I can see, what you are seeing shouldnt happen.

But, I see that you are setting some values to nr_roots, so perhaps your problem is that it is actually not 1 for cases where you expect it should be 1

" nr_roots = atomic_add(&level[level_local],0)/j; atomic_xchg(&count,0); nr=0; rest = atomic_add(&level[level_local],0)%j; \n" \

" if(k<rest) nr_roots = nr_roots + 1;} \n" \

You are assuming nr_roots is 1, but I would use printf to check it? (perhaps only printf when it is not 1?) Can you do this?

t-man · ‎10-10-2012

I did this and it is 1, the problem is that count is 0, does the iteration and then it is 0 again for the case when nr_roots is 1. I am trying to find out why 1 group would go through the iteration two times with the same count.

And the way nr_roots is calculated is fine, I checked it. The main problem is the count

k = get_group_id() , this is k. no problem here, checked it also, and for the work -group 0 ( threads 0 1 2 3 4 5) glob%6==1 will give glob 1

I really appreciate all your help, thanks , maybe you have some other suggestions!

yurtesen · ‎10-10-2012

Yes you are right about group zero I should go and rest I guess

There is something strange in your explanation, if count is zero, it cant just jump to 2. Because it will be the same thread which increase count within a workgroup (thread id 1 in first workgroup, 7 in second workgroup etc.). I think it should know if it increased it or not

What about printing it right after atomic add? If what you say is true, you should see it zero right? Do you see it zero?

if(glob%6==1) {atomic_add(&count,1);

I can run and see what your program is doing if you can make a small test case. But I have no other ideas about what might be wrong at this point

t-man · ‎10-10-2012

thanks again for the swift reply. So What should I do regarding the test case? Should I give you the whole source? Could that prove helpful?

My email is tudor_uricec@yahoo.com ( skype is same tudor_uricec ). Thanks!

yurtesen · ‎10-10-2012

Normally a perfect test case is a program which is as simple as possible yet still can cause the problem, with some instructions on how you are compiling and running it obviously You can attach it to the thread so everybody who is interested can have a look.

t-man · ‎10-11-2012

compile : gcc -o hello_world -I /opt/AMDAPP/include/ -L /opt/AMDAPP/lib/x86/ -I /home/tudor/Desktop/examples/OpenCL_Hello_World_Example/ hello.c -lOpenCL -lm , where the -L option represent the library for the OPENCL and first -i is theinclude for opencl and the secon is the path to the defs.h header.

execute: ./hello_world -parallel - grid 3 4 , which will create a 3 x 4 graph and will execute on two workgroups of 6 threads each. Any help is more then appreciated ! Thank you !

Also the results will be saved in BC and u will notice that BC[11] should have the correct result as 10 , but sometimes it has 14 since the second workgroup iterates through the count while twice, when it shouldnt.

If you take out the "if(glob==11) " statement that BC will be filled with all the values.

yurtesen · ‎10-11-2012

It is trying to load a "Wiki-Vote.txt" and then prints failed to load kernel.

Also, if you use:

#include "defs.h"

You can simply compile it...

$ gcc -o hello hello.c -lOpenCL

hello.c: In function âbetwenessparallelâ:

hello.c:588:17: warning: format â%dâ expects argument of type âintâ, but argument 4 has type âsize_tâ [-Wformat]

hello.c:756:17: warning: format â%dâ expects argument of type âintâ, but argument 2 has type âsize_tâ [-Wformat]

$

This works on Ubuntu...

yurtesen · ‎10-11-2012

On another note

clGetDeviceInfo(device_id, CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(int),(void *) &l, NULL);

count = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(int)*l, NULL, NULL);

For a CPU the max compute units is the core count....but you are using:

k = get_group_id(0);

The group id can be larger than the core count, it is global/local although for you it shouldnt be a problem for this test case since you seem to have: (and probably dual-core or more core procesor so...)

local = 6

global = local*2;

I really have to suggest cleaning up the code and putting those kernels out from there. It is not very easy to follow what is going on...

t-man · ‎10-13-2012

when you execute it you have to use -parallel as a argument ./hello_world -parallel -grid 3 4

yurtesen · ‎10-14-2012

As I mentioned earlier, you seem to have forgotten some files:

$ ./hello -parallel -grid 3 4
Generating 2D grid with 3 rows and 4 columns, for 12 vertices in all.
Time to generate grid graph is 42.428627 sec.
Graph has 12 vertices and 34 edges.
firstnbr = 0 2 5 8 10 13 17 21 24 26 29 32 34
nbr = 1 4 0 2 5 1 3 6 2 7 0 5 8 1 4 6 9 2 5 7 10 3 6 11 4 9 5 8 10 6 9 11 7 10
number of edges 12
Running parallel betweenness centrality...
1 devices
Device name = Intel(R) Xeon(R) CPU E5430 @ 2.66GHz; Number of compute units 8; Number of workitems per workgroup 1024
Failed to load kernel.Â¥ne$

t-man · ‎10-14-2012

hey yurtesen!

You should eliminate from the code , starting with line 609

"

fp = fopen("BCkernel.cl", "r");

if (!fp) {

fprintf(stderr, "Failed to load kernel.¥n");

exit(1);

}

source_str = (char *)malloc(MAX_SOURCE_SIZE);

source_size = fread(source_str, 1, MAX_SOURCE_SIZE, fp);

fclose(fp);"

this. It was just a test of mine to put the kernel in a different file, but for some reason it didnt work. Didnt notice it would not let you execute everything. Thanks and sorry!

P.S. You will notice that the graph generated is

0 1 2 3

4 5 6 7

8 9 10 11

and what the algorithm calculates is the sigma for each element. Sigma for 0 is 1, and all nodes have sigma equal to the sum of its parent. So 5 for example has 1 from node 1 and 1 from node 4 so sigma = 2. But the problem is that the second work-group takes node 4 from the "stack" two times, because the count variable hasnt been incremented, giving node 5 value 3 for sigma, and thus node 11 instead of having sigma 10, it has sigma 14. ( so BC[11] will have the value for sigma[11] ). Hope it makes sence!

yurtesen · ‎10-14-2012

Actually, it does segmentation fault and nothing comes out. I ran it with valgrind and I think you are copying some data from unallocated memory spaces, and using unitialized data etc. In my opinion your problem is not due to your kernel itself (at least you seem to have much more serious problems in the host code), you should try to clean up your code and make it more readable. Maybe you can find the problem yourself also This is what I had when I tried to run your code:

$ ./hello -parallel -grid 3 4
Generating 2D grid with 3 rows and 4 columns, for 12 vertices in all.
Time to generate grid graph is 50.927097 sec.
Graph has 12 vertices and 34 edges.
firstnbr = 0 2 5 8 10 13 17 21 24 26 29 32 34
nbr = 1 4 0 2 5 1 3 6 2 7 0 5 8 1 4 6 9 2 5 7 10 3 6 11 4 9 5 8 10 6 9 11 7 10
number of edges 12
Running parallel betweenness centrality...
1 devices
Device name = Intel(R) Xeon(R) CPU E5430 @ 2.66GHz; Number of compute units 8; Number of workitems per workgroup 1024
Pot sa am 1024 workitems pe workgroup
Segmentation fault (core dumped)
$

I had good luck with the following C code for loading from file:

    printf("\nTrying to use OpenCL source file %s\n", CLFILE);
    int fd = open(CLFILE, O_RDONLY);
    if (fd == -1) {
      fprintf(stderr, "Couldn't open the source file\n");
      exit(1);
    }
    struct stat filestat;
    if (fstat(fd, &filestat) == -1) {
      fprintf(stderr, "Couldn't stat the source file\n");
      exit(1);
    }
    size_t size = filestat.st_size;
    const char *data = (char*) mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
    printf("Mapped the source file (%d bytes) to %p\n", size, data);
    cl_program program = clCreateProgramWithSource(context, 1, &data, &size, &err);
    if (err != CL_SUCCESS) {
      fprintf(stderr, "Couldn't create the OpenCL program\n");
      exit(1);
    }

Good luck

t-man · ‎10-14-2012

Thanks for all your help, much appreciated! I will try doing that!

t-man · ‎10-14-2012

Hmm, tried it like you did but still not working. Any suggestions? Thanks!

yurtesen · ‎10-14-2012

Like I said, your program was generating segmentation faults anyway. Putting kernel to separate file probably wont help.

But in this specific case, you have a /* on line 103 but never a */ which matches it. ( I believe you need */ on lline 141).

Then after this, there are some other problems but you should see them in the error log (it will print them same as below).

I cant debug your program for you. You should put some effort to finding problems and making it properly readable will make it easier for you. You could maybe spot a */ missing if you didnt have so many stuff on a single line on a single line 103:

barrier(CLK_LOCAL_MEM_FENCE);if(glob!=0&&glob!=1&&glob!=2&&glob!=3)  BC[glob] = atomic_xchg(&sigma[glob],sigma[glob]);}; /*

Trying to use OpenCL source file BCkernel.cl
Mapped the source file (9087 bytes) to 0x7fd92f41a000
Error: Failed to build program executable!
CL_BUILD_PROGRAM_FAILURE
"/tmp/OCLzKCteO.cl", line 47: error: unrecognized token
                            pozition_local = atomic_add(pozition,0); nr = nr +1;\n
                                                                                ^
"/tmp/OCLzKCteO.cl", line 47: error: expected an expression
                            pozition_local = atomic_add(pozition,0); nr = nr +1;\n
                                                                                ^
"/tmp/OCLzKCteO.cl", line 85: error: function "GetSemaphor2" declared
          implicitly
                                      GetSemaphor2(&sem[0]);     temporal = atomic_xchg(&sigma[node],0); temporal2=atomic_xchg(&sigma[root],0);
                                      ^
"/tmp/OCLzKCteO.cl", line 86: error: function "ReleaseSemaphor2" declared
          implicitly
                                           atomic_xchg(&sigma[node],temporal+temporal2);atomic_xchg(&sigma[root],temporal2);ReleaseSemaphor2(&sem[0]);
                                                                                                                            ^
"/tmp/OCLzKCteO.cl", line 36: warning: variable "ft" was declared but never
          referenced
      float ft,aux1,delta_temp;
            ^
"/tmp/OCLzKCteO.cl", line 36: warning: variable "aux1" was declared but never
          referenced
      float ft,aux1,delta_temp;
               ^
"/tmp/OCLzKCteO.cl", line 36: warning: variable "delta_temp" was declared but
          never referenced
      float ft,aux1,delta_temp;
                    ^
4 errors detected in the compilation of "/tmp/OCLzKCteO.cl".
Internal error: clc compiler invocation failed.

t-man · ‎10-14-2012

Thank you for all your help so far, i ill come back when everything is clean and running

I did not get any info on my errors so it was really hard for me to debug. Now I can clean it and will post later when everything is working!

All the best to you !

yurtesen · ‎10-15-2012

Perhaps the compiler just does not understand a */ is missing for closing the comments and bailing out altogether. It might be a bug actually, I will check it later and try to report to AMD if I can repeat the issue and if I dont forget

t-man · ‎10-15-2012

Ok now it should work !

So problem is as follows, the count_priv variable is 0, one work-group does an interation in that while, after which count_priv is still 0, even though there is a line in the code stating "count_priv = count_priv + 1". This happens only for the case when the root = stack[pozition_local + count_priv*j + k]; is 4. ( you have to run it with -parallel -grid 3 4 to understand what I am talking about).

So basically everything is very indeterministic. I have no idea why . Any suggestions?

P.S. the graph is

0 1 2 3

4 5 6 7

8 9 10 11

And sigma of node 0 is 1, sigma of all other nodes is 0. Each node along the way takes sigma value of its own + the sigma of its parent (bfs). So node 1 and 4 will take 1, node 5 will take 2, node 2 will take 1 etc.

The problem is that node 4 will be iterated twice because count_priv is not incremented making sigma of node 5 = 3. The last line :

if(glob==11) BC[glob] = atomic_xchg(&sigma[glob],sigma[glob]);

Introduces at position 11 the sigma of node 11, which normally is 10, but when node 4 is taken twice it will be 14 ( incorrect). Hope this makes sence!

Thanks again for everything!

yurtesen · ‎10-15-2012

Look I added 3 lines to your code, one after first while:

printf("befor while groupd id %d local id %d global id %d count %d nr_roots %d\n",k,i,glob,count_priv,nr_roots);
while(count_priv < nr_roots ){
printf("after while groupd id %d local id %d global id %d count %d nr_roots %d\n",k,i,glob,count_priv,nr_roots);

and one after count+1;

count_priv = count_priv + 1;
printf("after count  groupd id %d local id %d global id %d count %d nr_roots %d\n",k,i,glob,count_priv,nr_roots);

Your first problem is the nr_roots variable which you have it as a local variable, it gets increased by other threads and become 2

The other problem is when your while count_priv<nr_roots loop exits, it also resets the count_priv=0 because it still continues to run found_local!=0 loop.

To put it simply, you seem to have bugs...

Here is a sample run:

$ ./hello -parallel -grid 3 4 |grep 'groupd id 0 local id 3 global id 3'

Generating 2D grid with 3 rows and 4 columns, for 12 vertices in all.

Time to generate grid graph is 43.088371 sec.

Graph has 12 vertices and 34 edges.

firstnbr = 0 2 5 8 10 13 17 21 24 26 29 32 34

nbr = 1 4 0 2 5 1 3 6 2 7 0 5 8 1 4 6 9 2 5 7 10 3 6 11 4 9 5 8 10 6 9 11 7 10

Running parallel betweenness centrality...

befor while groupd id 0 local id 3 global id 3 count 0 nr_roots 1

after while groupd id 0 local id 3 global id 3 count 0 nr_roots 1

after count groupd id 0 local id 3 global id 3 count 1 nr_roots 1

befor while groupd id 0 local id 3 global id 3 count 0 nr_roots 1

after while groupd id 0 local id 3 global id 3 count 0 nr_roots 1

after count groupd id 0 local id 3 global id 3 count 1 nr_roots 1

befor while groupd id 0 local id 3 global id 3 count 0 nr_roots 2

after while groupd id 0 local id 3 global id 3 count 0 nr_roots 2

after count groupd id 0 local id 3 global id 3 count 1 nr_roots 2

after while groupd id 0 local id 3 global id 3 count 1 nr_roots 2

after count groupd id 0 local id 3 global id 3 count 2 nr_roots 2