cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

philips
Journeyman III

Porting from CUDA to OpenCL: Pointer problem

program terminates with unhandled exception

Hi.

I'm porting a program from CUDA to OpenCL. They store some data in a tree structure and from a given node in the tree, they want to access some of the information in that node.

 


the CUDA version works as it should.

the OpenCL version (running on the same data) terminates with an unhandled exception.

If I manually set colorBlock as a random unsigned long it works and does not terminate with the exception. It seems that somehow somewhere in those first seven lines something goes wrong and unsigned long colorBlock is not assigned a valid unsigned long.

 

New information:

 

The program now runs on a Nvidia GPU. I did not change the function at all.

 

On the CPU (with the Ati platform) there's still the same error.

 

 

 



 

 

 


I really hope someone knows what's wrong here, because I have no clue.

Thanks for reading.

 

//CUDA VERSION _device_ void lookupInfo(int* node) { int* pageHeader = (int*)( (int) node & -PageBytes ); //PageBytes is a constant int value int* blockInfo = pageHeader + *pageHeader; int* blockStart = blockInfo + blockInfo[ BlockInfo_BlockPtr ]; //BlockInfo_BlockPtr is a constant int value //the following three lines are similar to the ones above, so I ll leave the details out int* attachInfos = ... int* attachInfo = ... int* attachData = ... unsigned long long* dxtBlock = (unsigned long long*) ( attachData + ( (node - blockStart) >> 2) * 6); U64 colorBlock = dxtBlock[0]; // after that the function uses the variable colorBlock ... } //OPENCL VERSION void lookupInfo(__global int* node) { __global int* pageHeader = (__global int*)( (int) node & -PageBytes ); //PageBytes is a constant int value __global int* blockInfo = pageHeader + *pageHeader; __global int* blockStart = blockInfo + blockInfo[ BlockInfo_BlockPtr ]; //BlockInfo_BlockPtr is a constant int value //the following three lines are similar to the ones above, so I ll leave the details out __global int* attachInfos = ... __global int* attachInfo = ... __global int* attachData = ... __global unsigned long* dxtBlock = (__global unsigned long*) ( attachData + ( (node - blockStart) >> 2) * 6); unsigned long colorBlock = dxtBlock[0]; // after that the function uses the variable colorBlock ... }

0 Likes
15 Replies
Illusio
Journeyman III

I suppose it's possible that all your code gets optimized away if you assign colorBlock as a random so it might not work in that case either.

Anyway, casting "node" to "int" isn't a safe operation. It's possible to have 64 bit pointers and 32 bit integers in OpenCL. The cast should probably be to intptr_t.

It's also a bit hard to evaluate if the alignment operation with PageBytes is safe. It could bomb if there's an issue with the external allocation.

 

0 Likes

I think you might be right that the problem is with that line:

__global int* pageHeader = (__global int*)( (int) node  &  -PageBytes ;

I just took it directly from CUDA. 

Now I've changed it to intptr_t, but that did not help. 

 

Here is the disassembly. Maybe you guys can make sense of it

 http://pastebin.com/trEpWqWK

I have highlighted the line where Visual Studio pointed to the error.

(line 371)

 

I am not exatly sure where in that code that is...

0 Likes

Here is the full function just as I am using it.

 

PageBytes == 1 << 13;

 

 

void lookupVoxelColorNormal(float4* colorRes, float4* normalRes, const CastResult* castRes, const CastStack* stack) { // Find DXTNode. __global int* pageHeader = (__global int*)((intptr_t)(*castRes).node & -PageBytes); __global int* blockInfo = pageHeader + *pageHeader; __global int* blockStart = blockInfo + blockInfo[2]; __global int* attachInfos = blockInfo + 4; __global int* attachInfo = attachInfos + 2 * 1; __global int* attachData = blockInfo + attachInfo[1]; int foo = (int)((*castRes).node - blockStart); foo = foo >> 2; foo *= 6; __global int* bar = attachData + foo; __global unsigned long* dxtBlock = (__global unsigned long*) bar; // Fetch. U64 colorBlock = dxtBlock[0]; U64 normalBlockA = dxtBlock[1]; U64 normalBlockB = dxtBlock[2]; // Decode. int texelIdx = (*castRes).childIdx | (int)((((*castRes).node - pageHeader) & 2) << 2); float4 tmp = decodeDXTColor(colorBlock, texelIdx); *colorRes = (float4)(tmp.x, tmp.y, tmp.z, 255.0f); //*normalRes = decodeDXTNormal(normalBlockA, normalBlockB, texelIdx); }

0 Likes

New information:

 

The program now runs on a Nvidia GPU. I did not change the function at all.

 

On the CPU (with the Ati platform) there's still the same error.

 

 

 

 

0 Likes

You probably need the same cast in the line

    int foo = (int)((*castRes).node - blockStart);

This should likely be (int)((intptr_t)(*castRes).node - blockStart);

0 Likes

Ok, my assembly might be a little rusty, but here goes.

...

369: And rax with -PageBytes(so your aligned pageHeader variable is in rax in other words)

370: Load rcx with contents of memory location pointed to by rax(That is, the "*pageHeader" part of the blockInfo assignment)

371(where it crashes): Load r8 with the contents of [rax+rcx*4+1Ch] Which looks like it implements the assignment to attachData.

In other words, your code has actually read out the contents of *pageHeader without crashing, so likely the alignment operation is fine. However, blockInfo, which is calculated using the data in *pageHeader appears to be invalid for some reason.

I'd guess that you have garbage data in *pageHeader. Maybe a  coherence issue? Anyway, if you can do a register dump at the point where it crashes you can check if rcx looks like a reasonable value for an offset to compute blockInfo with. If it looks fine, then I'm out of ideas.

 

0 Likes

Thank you, Illusio. That is really good to know.

 

I suppose you have seen my comment saying that it now works on the Nvidia GPU.

 

Do you have any idea if there could be some differences between CPU and GPU?

It did not seem relevant at the time, but I have noticed that if I set my random color and don't use those pointer operations, the color will be different if I run it on the cpu and not on the gpu.

I don't know. Could it be some kind of Endian problem? But if it were I guess I would have had problems with the rest of the code, though.

 

 

 

@ri239:

I think that case is a little different. Here I subtract an int* from an int*. I believe that should be a ptrdiff_t

but I guess you are right, though. 

maybe that should be

ptrdiff_t foo = (*castRes).node - blockStart;

 

 

 

 

0 Likes

Originally posted by: philips

Do you have any idea if there could be some differences between CPU and GPU?

It did not seem relevant at the time, but I have noticed that if I set my random color and don't use those pointer operations, the color will be different if I run it on the cpu and not on the gpu.

Endian issues are mostly handled automatically by OpenCL. It shouldn't be an issue in code like this.

However, differences in behavior like what you describe here are almost certainly caused by lack of synchronization in the host code. Either in time( i.e. not using clFinish(), synchronous operations or clEnqueueBarrier() to make sure previous operations have completed before attempting to use the data in subsequent operations) or in memory( Not mapping/unmapping a buffer when making changes to a buffer backed by host memory for example, or simply forgetting to write modified hostside data to the OpenCL buffer)

0 Likes

It seems that I have solved that problem.

 

Maybe you can clarify this: it appears that on the GPU the __global pointers in a buffer always start at 0. 

||||||||||||||||||||||||||||||||||||||  total GPU memory

||||||||||||              ||||||||||        two Buffers

0            s1            0         s2

 

Is that a correct assumption?  

Because of this line: int* pageHeader = (int*)( (int) node  &  -PageBytes );

it is necessary that the beginning of the buffer in which node lies is aligned to a page, that is the last 13 bits of the buffer startadress have to be zero.

 

on the GPU that seems to work

 

on the CPU however, the real adresses are used in opencl and those do not start with zero. so because of this line the pointer pageHeader was completely wrong.

now I allocated the memory on the host with a sufficiently aligned pointer and then created the buffer using CL_MEM_USE_HOST_PTR

So now the whole buffer is aligned and the line leads to the very start of the buffer (or more precisely to the start of the respective page node resides in)

 

Thanks, guys  I couldn't have done it without you.

 

0 Likes

Great that you got it working! I'm a bit surprised that it turned out to be an alignment issue after all, given that your assembly dump appeared to have accessed the aligned pointer before dropping dead, but you can't argue with success.

However, now that it's clear how memory was allocated outside the function, there's a bug in your alignment code. The way to align pointers when you have no control over the start address is like this:

byte* p = alloc( needed_size_in_bytes+(required_alignment-1) )

Now, p might not be aligned, but you have space to move it forward up to "required_alignment-1" bytes and still have "needed_size_in_bytes" available. One of these addresses are necessarily aligned.

 

The way to do this is:

byte* alignedP = (byte*)( (((intptr_t)p)+(required_alignment-1))) & -required_alignment);

In your code, when you just "and" the buffer start with -PageBytes and this will move the aligned pointer backwards(before the start of the allocated buffer) if the pointer wasn't aligned to begin with.

Right now you've just hidden the bug CPU-side by doing proper alignment outside your code, and are relying on undocumented allocation behavior in ATI's OpenCL stack.

 

0 Likes

Actually... come to think of it. I think your code is inherently broken. The reason is that you're dealing with two pointers in two different address spaces. You have no guarantee that they both can be aligned with the same operation.

Say, for example that your CPU pointer was 4096 aligned, but the buffer, when mapped/copied into GPU memory space, happens to be 2048 aligned(Because some magical GPU architecture uses pages of size 2048). What would happen is that your GPU code would attempt to align the pointer by adding 2048, even though the actual data prepared by the CPU would begin at offset 0.

Because of that you need to write your GPU code so that it always assumes data starts at offset 0 in the buffers you have allocated. If the GPU code must have some kind of specific alignment(and this sounds like a fairly bad idea when you also have alignment requirements hostside) you need to copy the actual data from offset 0 to the first aligned offset, do whatever work you need to do, and then copy it back to offset 0 so the results will be properly aligned hostside(if the results are stored in the same buffer). This all sounds extremely hairy though. I think I'd try to rewrite the code to remove that alignment requirement. At the very least on the GPU side.

0 Likes

Yeah, one couldn't really see it in the assembly code. The pageHeader needed to be 8192 aligned.

 

 

I do the alignment on the host only for running the program completely on the host (no GPU). So with mem_use_host_ptr it gets aligned the right way.

 

On the GPU it just works without extra alignment. Which is why I thought that maybe all buffers start at 0 and are therefore implicitely aligned to my pagesize.

You are of course right that it might be wrong to assume that works all the time, as it's not documented how it really works. But doing it right on the GPU is rather complicated. Would be nice if there were a way to allocate the whole buffer at the required alignment.

 

I hope Ati treats the buffers on the gpu the same way nvidia does. So I can just keep the code the way it is. If it doesn't run on an Ati I will have to do the alignment right.

 

However right now I don't even get it to compile on an Ati GPU   (apparently a compiler bug)

 

 

Fortunately my code does not have to be good  it just has to run. 

What I am trying to do is just get the Cuda code running on an nvidia gpu, an ati gpu and a cpu and then analyze the speed of the algorithm on the different architectures.

I try to change the Cuda code as little as possible. If I changed that & -pageBytes line I could probably do it without the alignment requirement. If I just saved the base address, the pageheaders would be at baseAddress + ((nodeAddress - baseAddress) / pageSize) * pageSize.

In the Cuda code they just aligned it that way to find the pageheaders faster. 

 

 

0 Likes

philips,
This might be illegal:
void lookupVoxelColorNormal(float4* colorRes, float4* normalRes, const CastResult* castRes, const CastStack* stack)
{
// Find DXTNode.

__global int* pageHeader = (__global int*)((intptr_t)(*castRes).node & -PageBytes);

You cannot cast from a private address space pointer to a global address space pointer in OpenCL.
0 Likes

Am I doing that? node is a global pointer. I just take the pointer and set the last 13 bits to 0.

 

 

0 Likes

philips,
Ok, I just wanted to bring that to your attention to confirm that you aren't doing somethign illegal. The definition of CastResult was not in the source you posted to this thread.
0 Likes