cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

t-man
Adept II

While loop executed twice by a work-group

Well the problem I have is the following:

I have this kernel that tries to calculate the betweenness centrality of a graph in parallel. What happens is actually very strange since a loop is executed two times by one of the work groups. In average once every 7-8 runs the second while loop ( while(count<nr_roots) ) gets executed twice by a workgroup although at the first iteration the count variable is incremented. So for my case I have a 12 vertices graph and nr_roots is 1, count is 0, the while gets executed, count gets incremented by 1, and still cout will be viewed as 0 one more time.

This happends only once every 6-7 runs, remeber that, not always. Does anyone have any idea why? I also tried making count a __local variable (shared by group) and also made it __private ( shared by the work-item only), no success. Any tips, suggestions are more than welcome!

" while ( found_local != 0){ \n" \

        "                                                                       \n" \

        "                                                                                     \n" \

         "                                                                                      \n " \

         "   if(i==0) {    level_local = atomic_add(nr_level,0); atomic_xchg(found,0);\n" \

         "                 pozition_local = atomic_add(pozition,0);\n " \

        "                  nr_roots = atomic_add(&level[level_local],0)/j; atomic_xchg(&count,0); nr=0; rest = atomic_add(&level[level_local],0)%j;  \n" \

         "                 if(k<rest) nr_roots = nr_roots + 1;}                                                 \n" \

          "                                                                 \n" \

           "                                                                \n" \

            "            barrier(CLK_GLOBAL_MEM_FENCE); \n   " \

                        "                              \n" \

             "           while(count < nr_roots ){   \n" \

              "           \n" \

                "              if(i==0){    \n" \

               "                 root = stack[pozition_local + count*j + k];\n" \

                "             succ_index[root] = 0; \n" \

                 "              nr_neigh = firstnbr[root+1] - firstnbr[root]; } \n" \

                  "           barrier(CLK_LOCAL_MEM_FENCE);\n" \

                   "         \n" \

                    "            neigh_per_thread = nr_neigh/size; \n" \

                     "       if(i<nr_neigh%size) \n" \

                      "          neigh_per_thread ++; \n" \

                       "     h = 0;  \n" \

                        "    while(h<neigh_per_thread)\n" \

                        "        {\n" \

                         "       node = nbr[firstnbr[root] + size*h + i];\n" \

                          "       \n" \

                           "     dw = atomic_cmpxchg(&d[node], -1, level_local + 1);\n" \

                            "    \n" \

                             "   if(dw == -1)\n" \

                              "          {\n" \

                                "         atomic_inc(&level[level_local + 1]);\n" \

                                  "       atomic_cmpxchg(found,0,1);\n" \

                                   "      dw = level_local + 1;\n" \

                                    "     gh = atomic_inc(nr_stack);\n" \

                                     "    stack[gh] = node;\n" \

                                     " \n" \

                                      "  }\n" \

                                "if(dw == level_local + 1)\n" \

                                 " {                                              \n" \

                                  "                                             \n" \

                                   "       temp = atomic_inc(&succ_index[root]);\n" \

                                   "      succ[firstnbr[root] + temp] = node;\n" \

                                   " GetSemaphor2(&sem[0]);     temporal = atomic_xchg(&sigma[node],0); temporal2=atomic_xchg(&sigma[root],sigma[root]);                                          \n" \

                                    "     atomic_xchg(&sigma[node],temporal+temporal2);ReleaseSemaphor2(&sem[0]);     \n" \

                                     "   }                              \n" \

                                "h++;                                   \n" \

                                "}                                      \n" \

                            "                                          \n" \

                       "if(glob%6==1) {atomic_add(&count,1);if(root==4&&nr1==1) BC[8] = 1;} \n" \

    "                   barrier(CLK_GLOBAL_MEM_FENCE); }  \n" \

                       " \n" \

               " barrier(CLK_LOCAL_MEM_FENCE);\n"

                "if(glob==0) {f= atomic_add(&level[level_local],0); atomic_add(pozition,f); atomic_add(nr_level,1); \n" \

                " }                                                     \n" \

                "                                                       \n" \

                " if(i==0) \n" \

                "       { atomic_add(global_sync,1); \n" \

                "        if ( k==0) { while(atomic_add(global_sync,0)< j); atomic_xchg(global_sync, 0); } \n" \

                "        else { while(atomic_add(global_sync,0) > 0); }} \n" \

                "barrier(CLK_LOCAL_MEM_FENCE);if(i==0) found_local = atomic_add(found,0);barrier(CLK_LOCAL_MEM_FENCE);\n" \

       "}  if(glob==11) BC[glob] = atomic_xchg(&sigma[11],sigma[11]); } \n";

0 Likes
1 Solution
t-man
Adept II

I managed to localize the problem. The idea was that one workgroup was going through the iteration much faster then the other, incrementing the global variable "nr_level" and making the other workgroups see a wrong value. Thank you yurtesen for all your help!

View solution in original post

0 Likes
40 Replies
yurtesen
Miniboss

You are increasing count inside the while loop, but I think, until that point many threads can go past while(count < nr_roots ) line (therefore they can enter to the while loop before one thread has chance to increase count). When that happens, two or more threads can increase count even if it goes over nr_roots Does that make sense?

0 Likes

Well, the count is increased only by a single thread "if(glob%6==1)" where glob is the global counter of the work-items, after which i make a barrier to be sure that all the work-items see that count is increased before they start the new iteration.

The point of each iteration is for each thread to go through it once, and if you see I have a "if(i==0) root =stack[].." which means that only the less significant thread should do that bit, and that gets done twice for some reason with count = 0, I tested it.

Is that logic correct, or what exactly do you mean? Thanks for your help!

0 Likes

Lets say threads with "glob"al id 1 and 7 entered to the while(count < nr_roots ) loop when count was zero. What would stop them from increasing the counter while they are in there? am I understanding something wrong?

By the way, having kernels in a separate file makes reading code much easier.

0 Likes

The thing here to mention is that "k"represents the group number so count is individual per work-group. so actually the two global threads 1 and 7 will modify a different pozition in the count, having count being modified by a single thread in a work-group.

I am sorry about the kernel but when i tried to place it in a separate file the code stop working for some reason .

I really appreciate any help you can give me at this point. Thank you!

0 Likes

How can you tell if global threads 1 and 7 will be in a different workgroup id(k)? If your workgroup size is 64, wouldnt threads between 0 and 64 will be in same workgroup id(k) ?

0 Likes

I explained it poorly, i am sorry. This code is for the test I am implying, when the workgroup is of size 6,of course 1 and 7 will work only for this case but the general question stands

Having a workgroup of size 6 why would this behavior happen?

0 Likes

First of all, if workgroup size was 6, then first workgroup would not increase count at all since they will have 0,1,2,3,4,5 global IDs. (glob)

if(glob%6==1) {atomic_add(&count,1)

If your 'k' workgroup id is correct and nr_roots is indeed 1 then from what I can see, what you are seeing shouldnt happen.

But, I see that you are setting some values to nr_roots, so perhaps your problem is that it is actually not 1 for cases where you expect it should be 1

        "                  nr_roots = atomic_add(&level[level_local],0)/j; atomic_xchg(&count,0); nr=0; rest = atomic_add(&level[level_local],0)%j;  \n" \

         "                 if(k<rest) nr_roots = nr_roots + 1;}                                                 \n" \

You are assuming nr_roots is 1, but I would use printf to check it? (perhaps only printf when it is not 1?) Can you do this?

0 Likes

I did this and it is 1, the problem is that count is 0, does the iteration and then it is 0 again for the case when nr_roots is 1. I am trying to find out why 1 group would go through the iteration two times with the same count.

And the way nr_roots is calculated is fine, I checked it. The main problem is the count

k = get_group_id() , this is k. no problem here, checked it also, and for the work -group 0 ( threads 0 1 2 3 4 5) glob%6==1 will give glob 1

I really appreciate all your help, thanks , maybe you have some other suggestions!

0 Likes

Yes you are right about group zero I should go and rest I guess

There is something strange in your explanation, if count is zero, it cant just jump to 2. Because it will be the same thread which increase count within a workgroup (thread id 1 in first workgroup, 7 in second workgroup etc.). I think it should know if it increased it or not

What about printing it right after atomic add? If what you say is true, you should see it zero right? Do you see it zero?

if(glob%6==1) {atomic_add(&count,1);

I can run and see what your program is doing if you can make a small test case. But I have no other ideas about what might be wrong at this point

0 Likes

thanks again for the swift reply. So What should I do regarding the test case? Should I give you the whole source? Could that prove helpful?

My email is tudor_uricec@yahoo.com ( skype is same tudor_uricec ). Thanks!

0 Likes

Normally a perfect test case is a program which is as simple as possible yet still can cause the problem, with some instructions on how you are compiling and running it obviously You can attach it to the thread so everybody who is interested can have a look.

0 Likes
t-man
Adept II

compile : gcc -o hello_world -I /opt/AMDAPP/include/ -L /opt/AMDAPP/lib/x86/ -I /home/tudor/Desktop/examples/OpenCL_Hello_World_Example/ hello.c -lOpenCL -lm   , where the -L option represent the library for the OPENCL and first -i is theinclude for opencl and the secon is the path to the defs.h header.

execute: ./hello_world -parallel - grid 3 4 , which will create a 3 x 4 graph and will execute on two workgroups of 6 threads each. Any help is more then appreciated ! Thank you !

Also the results will be saved in BC and u will notice that BC[11] should have the correct result as 10 , but sometimes it has 14 since the second workgroup iterates through the count while twice, when it shouldnt.

If you take out the "if(glob==11) " statement that BC will be filled with all the values.

0 Likes

It is trying to load a "Wiki-Vote.txt" and then prints failed to load kernel.

Also, if you use:

#include "defs.h"

You can simply compile it...

$ gcc -o hello hello.c -lOpenCL

hello.c: In function âbetwenessparallelâ:

hello.c:588:17: warning: format â%dâ expects argument of type âintâ, but argument 4 has type âsize_tâ [-Wformat]

hello.c:756:17: warning: format â%dâ expects argument of type âintâ, but argument 2 has type âsize_tâ [-Wformat]

$

This works on Ubuntu...

0 Likes

On another note

            clGetDeviceInfo(device_id, CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(int),(void *) &l, NULL);

            count = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(int)*l, NULL, NULL);

For a CPU the max compute units is the core count....but you are using:

k = get_group_id(0);

The group id can be larger than the core count, it is global/local although for you it shouldnt be a problem for this test case since you seem to have: (and probably dual-core or more core procesor so...)

            local = 6
            global = local*2;

I really have to suggest cleaning up the code and putting those kernels out from there. It is  not very easy to follow what is going on...

0 Likes

when you execute it you have to use -parallel as a argument ./hello_world -parallel -grid 3 4

0 Likes

As I mentioned earlier, you seem to have forgotten some files:

$ ./hello -parallel -grid 3 4

Generating 2D grid with 3 rows and 4 columns, for 12 vertices in all.

Time to generate grid graph is 42.428627 sec.

Graph has 12 vertices and 34 edges.

firstnbr = 0 2 5 8 10 13 17 21 24 26 29 32 34

nbr = 1 4 0 2 5 1 3 6 2 7 0 5 8 1 4 6 9 2 5 7 10 3 6 11 4 9 5 8 10 6 9 11 7 10

number of edges 12

Running parallel betweenness centrality...

1 devices

Device name = Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz; Number of compute units 8; Number of workitems per workgroup 1024

Failed to load kernel.Â¥ne$

0 Likes

hey  yurtesen!

You should eliminate from the code , starting with line 609

"

fp = fopen("BCkernel.cl", "r");

            if (!fp) {

                fprintf(stderr, "Failed to load kernel.¥n");

                exit(1);

          }

            source_str = (char *)malloc(MAX_SOURCE_SIZE);

            source_size = fread(source_str, 1, MAX_SOURCE_SIZE, fp);

            fclose(fp);"

this. It was just a test of mine to put the kernel in a different file, but for some reason it didnt work. Didnt notice it would not let you execute everything. Thanks and sorry!

P.S. You will notice that the graph generated is

0 1 2 3

4 5 6 7

8 9 10 11

and what the algorithm calculates is the sigma for each element. Sigma for 0 is 1, and all nodes have sigma equal to the sum of its parent. So 5 for example has 1 from node 1 and 1 from node 4 so sigma  = 2. But the problem is that the second work-group takes node 4 from the "stack" two times, because the count variable hasnt been incremented, giving node 5 value 3 for sigma, and thus node 11 instead of having sigma 10, it has sigma 14. ( so BC[11] will have the value for sigma[11] ). Hope it makes sence!

0 Likes

Actually, it does segmentation fault and nothing comes out. I ran it with valgrind and I think you are copying some data from unallocated memory spaces, and using unitialized data etc. In my opinion your problem is not due to your kernel itself (at least you seem to have much more serious problems in the host code), you should try to clean up your code and make it more readable. Maybe you can find the problem yourself also This is what I had when I tried to run your code:

$ ./hello -parallel -grid 3 4

Generating 2D grid with 3 rows and 4 columns, for 12 vertices in all.

Time to generate grid graph is 50.927097 sec.

Graph has 12 vertices and 34 edges.

firstnbr = 0 2 5 8 10 13 17 21 24 26 29 32 34

nbr = 1 4 0 2 5 1 3 6 2 7 0 5 8 1 4 6 9 2 5 7 10 3 6 11 4 9 5 8 10 6 9 11 7 10

number of edges 12

Running parallel betweenness centrality...

1 devices

Device name = Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz; Number of compute units 8; Number of workitems per workgroup 1024

Pot sa am 1024 workitems pe workgroup

Segmentation fault (core dumped)

$

I had good luck with the following C code for loading from file:

    printf("\nTrying to use OpenCL source file %s\n", CLFILE);

    int fd = open(CLFILE, O_RDONLY);

    if (fd == -1) {

      fprintf(stderr, "Couldn't open the source file\n");

      exit(1);

    }

    struct stat filestat;

    if (fstat(fd, &filestat) == -1) {

      fprintf(stderr, "Couldn't stat the source file\n");

      exit(1);

    }

    size_t size = filestat.st_size;

    const char *data = (char*) mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);

    printf("Mapped the source file (%d bytes) to %p\n", size, data);

    cl_program program = clCreateProgramWithSource(context, 1, &data, &size, &err);

    if (err != CL_SUCCESS) {

      fprintf(stderr, "Couldn't create the OpenCL program\n");

      exit(1);

    }

Good luck

0 Likes

Thanks for all your help, much appreciated! I will try doing that!

0 Likes

Hmm, tried it like you did but still not working. Any suggestions? Thanks!

0 Likes

Like I said, your program was generating segmentation faults anyway. Putting kernel to separate file probably wont help.

But in this specific case, you have a  /* on line 103 but never a */ which matches it. ( I believe you need */ on lline 141).

Then after this, there are some other problems but you should see them in the error log (it will print them same as below).

I cant debug your program for you. You should put some effort to finding problems and making it properly readable will make it easier for you. You could maybe spot a */ missing if you didnt have so many stuff on a single line on a single line 103:

barrier(CLK_LOCAL_MEM_FENCE);if(glob!=0&&glob!=1&&glob!=2&&glob!=3)  BC[glob] = atomic_xchg(&sigma[glob],sigma[glob]);}; /*

Trying to use OpenCL source file BCkernel.cl

Mapped the source file (9087 bytes) to 0x7fd92f41a000

Error: Failed to build program executable!

CL_BUILD_PROGRAM_FAILURE

"/tmp/OCLzKCteO.cl", line 47: error: unrecognized token

                            pozition_local = atomic_add(pozition,0); nr = nr +1;\n

                                                                                ^

"/tmp/OCLzKCteO.cl", line 47: error: expected an expression

                            pozition_local = atomic_add(pozition,0); nr = nr +1;\n

                                                                                ^

"/tmp/OCLzKCteO.cl", line 85: error: function "GetSemaphor2" declared

          implicitly

                                      GetSemaphor2(&sem[0]);     temporal = atomic_xchg(&sigma[node],0); temporal2=atomic_xchg(&sigma[root],0);

                                      ^

"/tmp/OCLzKCteO.cl", line 86: error: function "ReleaseSemaphor2" declared

          implicitly

                                           atomic_xchg(&sigma[node],temporal+temporal2);atomic_xchg(&sigma[root],temporal2);ReleaseSemaphor2(&sem[0]);

                                                                                                                            ^

"/tmp/OCLzKCteO.cl", line 36: warning: variable "ft" was declared but never

          referenced

      float ft,aux1,delta_temp;

            ^

"/tmp/OCLzKCteO.cl", line 36: warning: variable "aux1" was declared but never

          referenced

      float ft,aux1,delta_temp;

               ^

"/tmp/OCLzKCteO.cl", line 36: warning: variable "delta_temp" was declared but

          never referenced

      float ft,aux1,delta_temp;

                    ^

4 errors detected in the compilation of "/tmp/OCLzKCteO.cl".

Internal error: clc compiler invocation failed.

0 Likes

Thank you for all your help so far, i ill come back when everything is clean and running

I did not get any info on my errors so it was really hard for me to debug. Now I can clean it and will post later when everything is working!

All the best to you !

0 Likes

Perhaps the compiler just does not understand a  */ is missing for closing the comments and bailing out altogether. It might be a bug actually, I will check it later and try to report to AMD if I can repeat the issue and if I dont forget

0 Likes

Ok now it should work !

So problem is as follows, the count_priv variable is 0, one work-group does an interation in that while, after which count_priv is still 0, even though there is a line in the code  stating "count_priv = count_priv + 1". This happens only for the case when the root = stack[pozition_local + count_priv*j + k]; is 4. ( you have to run it with -parallel -grid 3 4 to understand what I am talking about).

So basically everything is very indeterministic. I have no idea why . Any suggestions?

P.S. the graph is

0 1 2 3

4 5 6 7

8 9 10 11

And sigma of node 0 is 1, sigma of all other nodes is 0. Each node along the way takes sigma value of its own + the sigma of its parent (bfs). So node 1 and 4 will take 1, node 5 will take 2, node 2 will take 1 etc.

The problem is that node 4 will be iterated twice because count_priv is not incremented making sigma of node 5 = 3. The last line :

if(glob==11) BC[glob] = atomic_xchg(&sigma[glob],sigma[glob]);

Introduces at position 11 the sigma of node 11, which normally is 10, but when node 4 is taken twice it will be 14 ( incorrect). Hope this makes sence!

Thanks again for everything!

0 Likes

Look I added 3 lines to your code, one after first while:

printf("befor while groupd id %d local id %d global id %d count %d nr_roots %d\n",k,i,glob,count_priv,nr_roots);

while(count_priv < nr_roots ){

printf("after while groupd id %d local id %d global id %d count %d nr_roots %d\n",k,i,glob,count_priv,nr_roots);

and one after count+1;

count_priv = count_priv + 1;

printf("after count  groupd id %d local id %d global id %d count %d nr_roots %d\n",k,i,glob,count_priv,nr_roots);

Your first problem is the nr_roots variable which you have it as a local variable, it gets increased by other threads and become 2

The other problem is when your while count_priv<nr_roots loop exits, it also resets the count_priv=0 because it still continues to run found_local!=0 loop.

To put it simply, you seem to have bugs...

Here is a sample run:

$ ./hello -parallel -grid 3 4  |grep 'groupd id 0 local id 3 global id 3'

Generating 2D grid with 3 rows and 4 columns, for 12 vertices in all.

Time to generate grid graph is 43.088371 sec.

Graph has 12 vertices and 34 edges.

firstnbr = 0 2 5 8 10 13 17 21 24 26 29 32 34

nbr = 1 4 0 2 5 1 3 6 2 7 0 5 8 1 4 6 9 2 5 7 10 3 6 11 4 9 5 8 10 6 9 11 7 10

Running parallel betweenness centrality...

befor while groupd id 0 local id 3 global id 3 count 0 nr_roots 1

after while groupd id 0 local id 3 global id 3 count 0 nr_roots 1

after count  groupd id 0 local id 3 global id 3 count 1 nr_roots 1

befor while groupd id 0 local id 3 global id 3 count 0 nr_roots 1

after while groupd id 0 local id 3 global id 3 count 0 nr_roots 1

after count  groupd id 0 local id 3 global id 3 count 1 nr_roots 1

befor while groupd id 0 local id 3 global id 3 count 0 nr_roots 2

after while groupd id 0 local id 3 global id 3 count 0 nr_roots 2

after count  groupd id 0 local id 3 global id 3 count 1 nr_roots 2

after while groupd id 0 local id 3 global id 3 count 1 nr_roots 2

after count  groupd id 0 local id 3 global id 3 count 2 nr_roots 2

befor while groupd id 0 local id 3 global id 3 count 0 nr_roots 2

after while groupd id 0 local id 3 global id 3 count 0 nr_roots 2

after count  groupd id 0 local id 3 global id 3 count 1 nr_roots 2

after while groupd id 0 local id 3 global id 3 count 1 nr_roots 2

after count  groupd id 0 local id 3 global id 3 count 2 nr_roots 2

befor while groupd id 0 local id 3 global id 3 count 0 nr_roots 1

after while groupd id 0 local id 3 global id 3 count 0 nr_roots 1

after count  groupd id 0 local id 3 global id 3 count 1 nr_roots 1

befor while groupd id 0 local id 3 global id 3 count 0 nr_roots 1

after while groupd id 0 local id 3 global id 3 count 0 nr_roots 1

after count  groupd id 0 local id 3 global id 3 count 1 nr_roots 1

Time for betweenness centrality is 43.088371 sec.

TEPS score is 9.469e+00

$

Thank you for your swift reply!

>Your first problem is the nr_roots variable which you have it as a local variable, it gets increased by other threads and become 2

nr_roots gets increased only in 1 line, at is increased by thread 0 of each group, so there will be no other thread increasing it. Is that logic ok?

>The other problem is when your while count_priv<nr_roots loop exits, it also resets the count_priv=0 because it still continues to run found_local!=0 loop.

     After the count while is finished then i synchronize all the threads, the found while starts over and count_priv has to be made 0 again since nr_roots will get a different value. Does this make sence as well?

count_priv = 0;

                //count_priv is 0 initially

            while(count_priv < nr_roots ) <--- so before going into the while each thread will have count_priv 0

So it happens sometimes(not always) that the threads from the second group get into the while with count_priv 0, they process they make it 1 and when they go in the while again, count_priv is 0 again and the "node" taken from the "stack" is the same as the iteration before. Why could this be?

This happens when sigma[11] is 14 instead of 10, because node 4 is taken two times from the stack instead of once, making sigma of node 5 value 3 instead of value 2.

I really appreciate all your help! It's more then useful, I am making this application for my master thesis so I really need to get it done!

0 Likes


t-man wrote:


nr_roots gets increased only in 1 line, at is increased by thread 0 of each group, so there will be no other thread increasing it. Is that logic ok?


No your thread goes all the way back to


while ( found_local != 0){

then start from there again and re-increase it sometimes

Also you are checking for local id, where thread with same local id exists twice since you have two groups.

if(i==0) {level_local = atomic_add(nr_level,0); atomic_xchg(found,0);
          pozition_local = atomic_add(pozition,0);
          nr_roots = atomic_add(&level[level_local],0)/j; count = 0; rest = atomic_add(&level[level_local],0)%j;
          if(k<rest) nr_roots = nr_roots + 1;
     }

If you put some printf statements to above area, you can see how many times you are increasing nr_roots.... it definitely becomes 2 at some point.

Or maybe I dont understand what you are trying to explain maybe somebody else can help further somehow...I dont think I can correct your code (basically because I dont seem to understand what it should be doing exactly)

0 Likes

So, when the threads go back to the while(found_local!=0) loop they will calculate a new nr_roots (nr_roots = atomic_add(&level[level_local],0)/j;{j is the number of workgroups)) which basically means how many roots will each workgroup process this iteration. The incrementation means, for example if I have two work-groups and 7 elements to be processed 7/2 = 3, and workgroup 0 will get an extra to nr_roots to make it 4. ( 3 + 4 = 7).

The thread with the same local id exists twice ( for 2 workgroups) but thats the point, I have one nr_roots for each group, and at each iteration it processes a nr_roots nodes.

Is there any suggestions you might have as to whom I can contact? Maybe I can talk with some1 on skype or msn or something to explain exactly. what is the problem. It will be much easier. It shouldnt take more then 5 min.

Thanks! Have a nice day!

0 Likes

I guess you just have to wait if somebody is interested to help your problem. Your code is little confusing for my taste

For example I dont understand why you have nr_roots = atomic_add(&level[level_local],0) / j   (why atomic add 0? makes no sense?)

Then sometimes when k < rest you increase nr_roots -> if(k<rest) nr_roots = nr_roots + 1; (and it is increasing it sometimes in your test case also)

Of course if you put me as co-author in your thesis, I can write your program from scratch hehe ok joking!

0 Likes

There is a lot of explanation to do, everything behind it has a purpose.

For example I had problems when accesing a variable from global memory that it wasnt the latest value, thus by doing atomic_add with 0 I know the function will return the latest value.

The second one is very simple. if for example I have 2 workgroups and level[level_local] nodes (for this example let's say 9 nodes) to be processed then each workgroup should get half of the nodes. But if there are uneven number of nodes by doing level[level_local] / 2 they will both get 4 nodes.so what I do I keep the value level[level_local]%j in rest and the first "rest"workgroups get 1 extra node to process. so for our example workgroup 0 increases nr_roots from 4 to 5.

so 4 + 5 = 9.

Different example 35 nodes and 4 workgroups.

1st step: each workgroup gets 8 nodes and 3 more are remaining, so workgroup 1 2 and 3 get 8 + 1 nodes. Makes sence?

The problem is actually quite interesting if you get into it The person that is willing to help me I can make it worth their while.  Thank you for your time so far!

0 Likes

atomic_add does not guarantee that you will have the latest value, for example if some thread is about to write there after you get it, you will simply get the previous value. Did you read it in its manual page that it helps you get latest values? no...

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/atomic_add.html

you will still simply get the value stored in the location at the time of running atomic_add, maybe it coincidentally helps. But if there is a problem in your code which caused you to get wrong values, you will simply continue getting wrong values but maybe less often, or different times.

nr_roots can be 1 or 2 at any point, because same workgroup can go back in while loop and increase it again (at random times). So, in the beginning it is 1 but then it becomes 2 at some point. If you are dividing some work, I assume the resulting amount assigned with nr_roots to a workgroup should not be changing eh?

I am dividing work, but per iteration nr_roots gets a different value.  First iteration nr_roots gets value "n", n nodes get processed by the threads in that work group, and at the next iteration it gets a new value "m" which is also processed by the threads in that work-group. So that is the point of nr_roots, being changed all the time. And you can see that at each iteration nr_roots gets a new value (nr_roots = atomic_add(&level[level_local],0)/j) and this value can be increased only once. if a new iteration starts. nr_roots gets a new value from the level array.

thank you!

0 Likes

I still think nr_roots is increased in sort of a random manner in your code. But if you say it is suppose to be like that, who am I to argue You have a while(count<nr_roots) loop but nr_roots can change when a thread is executing the loop. Therefore in my opinion you will get random results. I hope somebody who understand what you are trying to do better would write a comment...

0 Likes

Hey yurtesen!

I noticed that you can use printf in your kernel. How did you do that as I cannot? Thanks!

0 Likes

t-man wrote:

Hey yurtesen!

I noticed that you can use printf in your kernel. How did you do that as I cannot? Thanks!

Yes, I told you to use printf's to check the problem several times

Are you using AMDs SDK? because strangely it simply worked with AMDs SDK without any other code change. (I tested on CPU only, but it should work on GPU also).

Normally you should add the following pragma to get printf to be recognized:

#pragma OPENCL EXTENSION cl_amd_printf : enable

I am not sure why it worked without it in your code. You can also use it with GPUs. Also you should be able to debug your code using CodeXL, even line by line I think

http://developer.amd.com/tools/hc/CodeXL/Pages/default.aspx

0 Likes

#pragma OPENCL EXTENSION cl_amd_printf : enable this solved the problem thank you!

Question about the  CodeXL: Does it work if I have a Intel dual core?

0 Likes

t-man wrote:

#pragma OPENCL EXTENSION cl_amd_printf : enable this solved the problem thank you!

Question about the  CodeXL: Does it work if I have a Intel dual core?

It might work I guess, you simply wont have access to CPU performance counters for profiling. Let us know if you can get it working. Intel has some products but they cost a lot and I am not sure which product can debug OpenCL, you can maybe get a 30day trial for some of the products. Next time you might consider getting an AMD CPU, Intel is very expensive

Also, wasnt I also right that nr_roots get incremented? (perhaps due to nr_level being incremented, but never the less) ?

0 Likes

so the problem was as follows:

1st level workgroup 0 increments nr_level

1st level workgroup 1 checks nodes on the wrong level getting a wrong nr_roots as well . So you were right. Thanks for everything!

Going to try the debugger, have met some other interesting bugs now

0 Likes

I installed the AMD sdk for windows 7 64bits, I installed Visual Studio C++ 2012 and I have trouble setting up the environment.  Do you have any idea how to set up the libraries and includes right? Also should it be a problem that it is C++ and not C ( as my program is .c). It should have a C compiler right?

0 Likes