Discussion in 'GPGPU Technology & Programming' started by Dade, Feb 25, 2011.
APP 2.4 is part of the Catalyst 11.4 early preview.
it seems GF104 superscalar(2way 32+16sp array) sucks and only utilizes 224sp(2/3) here.. 5092*(480/224)*(825/950)=9475.. ~7% off though maybe it's because lesser bw, caches and error rate.. at least gf104 made for gaming not compute or else
BTW, there have been some impressive score recently reported on LuxRender forums.
This has been posted by KyungSoo, LuxMark running on 8 x 480GTX:
And this by Royoni, LuxMark running on 4 x HD6990 + 2 x Xeon E5620:
how the hell is he running 8 gtx460's ?
Apart from the fact that he's running GTX480s, not 460s, this way perhaps?
Plus,of course, single-slot watercooling for each one.
Plus,of course, a multidimensional fusion reactor
You only need troll science
He has recently scored a new record with 8x580GTX:
Power supply related problems aside, this should be the current limit (i.e. it should be not possible to use more than 8 cards).
Won't 8 HD 6990s (with single-slot H20 of course) run togehter or is 4cards/8chips their hard limit?
You can use PCI-E ribbon extenders to populate all the slots on the motherboard with double-wide cards. Some custom system enclosure would be required for such setup to mount and hold all the cards steady.
I know there is 8 devices limit (in the BIOS ? From where is it coming from ?) however I don't know how a HD6990/590GTX is counted, 1 or 2 devices
You have to factor also the power supply problem: Royoni was running LuxMark on 4 x HD6990 and he reported already quite some problem finding a suitable power supply.
You can not surpass 2KW on a normal power wall socket (at least where I live). You are going to need an industrial power line too
pot + kettle ?
Dade, have you profiled LuxMark on various hardware? Just curious to know where the bottlenecks lie on different architectures. Are any specific design decisions hindering or helping performance in a significant way?
Not recently but me and some other user have in the past. My current concern about the general validity of LuxMark as benchmark are:
1) the scene (i.e. LuxBall HDR) used for the benchmark is a bit too simple. The average path length is vary small (i.e. less than 2 rays). This lead to not very much divergence on GPU threads. A more complex/heavy scene could be more representative of generic rendering load.
2) kernel execution time could be too small on high-end GPUs and kernel lunch time/overhead could represent a too important aspect of the benchmark. For instance, you could probably obtain artificial high scores by optimizing only the kernel dispatch overhead.
3) memory accesses are extremely scattered. However I consider this a positive aspect of the benchmark. It does represent a more generic workload than some easy-for-GPUs task like matrix multiplication, etc. However this may favorite some architecture like NVIDIA Fermi (i.e. cache) over AMD GPUs.
Past profiling sessions have shown that the ALU utilization (on AMD VLIW GPUs) isn't bad (>60%).
In general, the code is written to run on Apple, AMD, NVIDIA, OpenCLs, on CPU and GPU devices. As any code written for portability it isn't particularly optimized but this should show how good an OpenCL implementation is in running generic code.
Thanks Dade, interesting info.
What's the other of the two benchmark scenes doing differently? I'm getting vastly different results on that one (it's luxball without the HDR) - lower ones that is, as well as a way more steep drop on AMD than on Nvidia: HD 6970 drops to 54 (!), GTX 580 to 1035 (!).
what are your scores before the drop
LuxBall HDR has as no direct light sampling (i.e. it is a brute force path tracer): while tracing the reverse path of light, at each vertex, it traces only a single ray to evaluate where is the next path vertex. Eventualy the path will hit the "background" and will receive light from there.
Other scenes have direct light sampling: 2 rays are shot at each path vertex, one to evaluate where the next path vertex is (as before) and another toward area light sources to check if they are visible.
As you see, LuxBall HDR requires about half of the work (ray to trace) to generate a sample compared to other scenes.
Brute force path tracing is also more SIMD-friendly as it usually leads to less thread divergence (it is what I used in http://www.youtube.com/watch?v=Dh9uWYaiP3s to achieve real-time).
Big thanks Dade, that explains a lot (though not quite the Geforce being 20x faster - i'd have thought maybe 2x at most). Of course, Radeon HD 5k+ should be 1.5-2x fast when brute force SIMDing something that remotely fits their VLIW.
Well, thread divergence is something that's better be avoided, especially running on AMD's VLIW architecture that would lead to very poor utilization. But the difference in performance here is really staggering!