They have the same price. But I want to know which one shows better performance in OpenCL programs.
Originally posted by: dongzaixx They have the same price. But I want to know which one shows better performance in OpenCL programs.
Dongzaixx,
HD4850 has 10 SIMDs means 800 shader processors
HD5670 has 5 SIMD's means 400 shader processors.
Most cases HD4850 performance better than HD5670 as HD4850 has double the shaders than HD5670.
But HD5670 is designed to support OpenCL completely.
Originally posted by: genaganna Originally posted by: dongzaixx They have the same price. But I want to know which one shows better performance in OpenCL programs.
Dongzaixx,
HD4850 has 10 SIMDs means 800 shader processors
HD5670 has 5 SIMD's means 400 shader processors.
Most cases HD4850 performance better than HD5670 as HD4850 has double the shaders than HD5670.
But HD5670 is designed to support OpenCL completely.
Since HD4XXX is not completely designed for OpenCL, that is why I ask this question. Do you have any benchmark?
Unless your OpenCL program uses local memory, 4850 will always give you better performance.
You can test the MatrixMultiplication sample in SDK 2.01, it runs without local memory on 7xx series and uses local memory on 8xx series. Use command line options = -x 256 -y 256- z 256
Originally posted by: n0thing Unless your OpenCL program uses local memory, 4850 will always give you better performance.
You can test the MatrixMultiplication sample in SDK 2.01, it runs without local memory on 7xx series and uses local memory on 8xx series. Use command line options = -x 256 -y 256- z 256
Just out of curiosity I've tried this test on HD5870 and on CPU (dual core Intel) to see how much I would benefit from using GPU versus CPU, and I see no gain at all. Here are the commands and their output:
% MatrixMultiplication -device cpu -x 256 -y 256 -z 256 -i 16
Executing kernel for 16 iterations
-------------------------------------------
KernelTime (ms) : 0.535729
GFlops achieved : 62.6332
KernelTime (ms) : 0.532304
GFlops achieved : 63.0362
KernelTime (ms) : 0.816905
GFlops achieved : 41.0751
KernelTime (ms) : 0.831021
GFlops achieved : 40.3774
..% MatrixMultiplication -device gpu -x 256 -y 256 -z 256 -i 16Executing kernel for 16 iterations-------------------------------------------KernelTime (ms) : 0.53732GFlops achieved : 62.4478KernelTime (ms) : 0.529792GFlops achieved : 63.3351KernelTime (ms) : 1.85189GFlops achieved : 18.119KernelTime (ms) : 0.831176GFlops achieved : 40.3698..
Originally posted by: gapon Originally posted by: n0thing Unless your OpenCL program uses local memory, 4850 will always give you better performance.
You can test the MatrixMultiplication sample in SDK 2.01, it runs without local memory on 7xx series and uses local memory on 8xx series. Use command line options = -x 256 -y 256- z 256
Just out of curiosity I've tried this test on HD5870 and on CPU (dual core Intel) to see how much I would benefit from using GPU versus CPU, and I see no gain at all. Here are the commands and their output:
% MatrixMultiplication -device cpu -x 256 -y 256 -z 256 -i 16
Executing kernel for 16 iterations
-------------------------------------------
KernelTime (ms) : 0.535729
GFlops achieved : 62.6332
KernelTime (ms) : 0.532304
GFlops achieved : 63.0362
KernelTime (ms) : 0.816905
GFlops achieved : 41.0751
KernelTime (ms) : 0.831021
GFlops achieved : 40.3774
..
% MatrixMultiplication -device gpu -x 256 -y 256 -z 256 -i 16Executing kernel for 16 iterations-------------------------------------------KernelTime (ms) : 0.53732GFlops achieved : 62.4478KernelTime (ms) : 0.529792GFlops achieved : 63.3351KernelTime (ms) : 1.85189GFlops achieved : 18.119KernelTime (ms) : 0.831176GFlops achieved : 40.3698..Is this what I should expect?
Gapon,
Please run with -x 2048 -y 2048 -z 2048
If you run for smaller dimensions, transfer time dominates the kernel time that is why you see such poor performance.