Archives Discussions

polarnick · ‎10-13-2017

Hi!

My problem is the same as Slow SPIR

But I think that my problem is with VGPRs usage. Under SPIR they are used much more. How can I investigate to workaround this behaviour? I can provide executable with OpenCL and executable with SPIR. But I wonder, if the twice slower SPIR performance for so many years wasn't investigated? Because I don't think that pure OpenCL without SPIR can be used in commercial apps.

dipak · ‎10-16-2017

Hi Nikolay,

Thanks for reporting it. Please share both versions of binary so our team could investigate the reason for such performance drop. Also, please mention about the setup details.

Regards,

polarnick · ‎10-16-2017

Thanks for response! I sent reproducers to you via PM.

More context:

I have algorithm that calculates correlation between two cameras and so kernel uses two functions: project(3D point) -> pixel on photo and unproject(pixel on photo) -> 3D ray. These functions depend on camera type - it can be pinhole/fisheye/etc.. This camera type is constant per kernel launch. And these functions looks like this:

int camera_type = passed via kernel arg;

if (camera_type == PINHOLE_CAMERA_TYPE) {

some math...

} else if (camera_type == FISHEYE_CAMERA_TYPE) {

some math...

} else if (...) {

...

}

And I found that performance is very low (220 seconds), I profiled kernels under CodeXL and found very low occupancy due to high VGPRs usage. My setup: Ubuntu 16.04 + amdgpu-pro 16.40-348864 + FirePro W9100

I hardcoded camera type in kernel to fixed value (that is correct for test dataset) by changing "int camera_type = passed via kernel arg;" to "int camera_type = PINHOLE_CAMERA_TYPE;". So compiler can easily drop out other if-else branches. And I faced huge speedup from 220 seconds to 129 seconds. By the way nvidia compiler doesn't show speedup after such change (and it also doesn't show speedup if other branches will be commented, so it seems that it distribute variables to registers more efficiently, and processing time/profiler shows the same).

And this looks like win, because it is possible to make camera_type compile time constant and everything will be OK.

But the same code compiled to SPIR (with fixed camera_type) shows much worse performance - 365 seconds.

So my question is - why OpenCL version with compile-time fixed camera type (binary #1, 129 seconds) so much faster than SPIR version of the same code with fixed camera type (binary #2, 365 seconds). And may be it is possible to have the same performance without inflating kernels by precompiling them for all camera types combinations (binary #3, 220 seconds)?

dipak · ‎10-17-2017

Thanks. I'll report to the concerned team.

Regards,

dipak · ‎10-24-2017

Update:

It seems that the issue is also reproducible with the latest internal builds. For further investigation, a ticket has been opened against the issue and assigned to the appropriate team.

Regards,

dipak · ‎11-09-2017

Hi Nikolay,

I'm very sorry to say that there will be no fix for this issue at this moment. In fact, as I've come to know, support for SPIR on amdgpu-pro is very limited currently.

Regards,

Archives Discussions

OpenCL is ~twice faster than SPIR version