Again about TressFX in OpenGL : the full OpenGL version (ie without OpenCL) works and display similar output as DX11 SDK.
However I have poor performance with the first pass (which fills per pixel linked list), it takes around 5ms to complete on my R9 290X while DX11 sample takes around 1ms.
On the other hand the simulation and fullscreen pass takes around 3ms to execute, which is mostly in line with the performance of DX11 sample.
My sample is sending similar amount of per pixel list bucket using debug color value in DX11 and OpenGL.
(Please note that this number are at close range and without head mesh that reduces pixel draw count, these numbers are worst case figures for the overall technic.)
The code is here :
The first pass basically takes less than 0.1 s to execute if I do not StoreFragments_Hair so I'm guessing my perf issue are tied to how I'm doing the per pixel list recording.
I have no idea why my code is that much slower than DX11 one, I mapped all DX11 call to their GL 4.3 counterpart, ie an ssbo for structured buffer, and an image for RWImage.
I use an atomic counter object buffer too for pixel id increment.
I monitored my app with GPU Perf Studio 2 but unfortunatly it doesnt give any useful information as I didn't find counter related to image, ssbo or atomics.
It only tells me that PAStalledOnRasteriser percentage is high (96%), which is expectable since I'm sending a lot of primitives.