PDA

View Full Version : Sisoft GPGPU benches


Florin
28-Nov-2009, 14:52
So Sisoft Sandra 2010 Lite (http://www.sisoftware.net/index.html?dir=dload&location=sware_dl_3264&langx=en&a=) is out and it now includes a pretty comprehensive set of GPGPU benchmarks, including OpenCL, DirectCompute, Nvidia CUDA and ATI Stream components for measuring both processing and memory bandwidth. Might be interesting to see some results.

However I'm running into some issues with the CUDA and DirectCompute benchmarks. Both of these consistently generate the 'Display driver nvlddmkm stopped responding and has successfully recovered.' error. This happens both on my Windows 7 i7 860 system with GTX280, and on a Vista AthlonX2 4200+ machine with a GTS8600 (this last system doesn't offer the DC benchmark btw, even though the platform update and DX11 redist are installed). Both are running WHQL driver 195.62 but the same happens with other drivers. Anyone on Nvidia able to run these 2 benchmarks succesfully?

trinibwoy
28-Nov-2009, 16:55
Thanks for the heads up. My single precision and internal bandwidth numbers for a GTX 285 @ 702/1584/1350:

CUDA - 461.5 MP/s, 132.5 GB/s
OCL - 462.5 MP/s, 132.1 GB/s
CS - 534.6 MP/s, 129.4 GB/s

Davros
28-Nov-2009, 19:19
why would bandwidth differ depending on api ?
it suggests its not being measured properly

Florin
28-Nov-2009, 19:30
I imagine that multiple runs of each type of test would show minor differences as well. The results are so close that the spread might not be statistically significant.

Tim Murray
28-Nov-2009, 19:52
I'm more than a little confused as to how switching to CS gets you a 15-20% performance improvement.

trinibwoy
28-Nov-2009, 20:58
No idea but the results are repeatable. Did notice one thing though, during the compute test my system goes completely unresponsive (no mouse) when running the CUDA version. That doesn't happen when running OCL or CS.

Tim Murray
28-Nov-2009, 21:07
How long does the CUDA run last? (I am a lazy dude so I haven't installed it myself)

trinibwoy
28-Nov-2009, 21:25
cuda: 2:15
ocl: 0:37
cs: 1:01

Tim Murray
28-Nov-2009, 22:53
somehow I do not think these tests are exactly equivalent!

trinibwoy
29-Nov-2009, 03:52
Downclocked shaders to 1400Mhz and got the following:

OCL - 376 MP/s
CUDA - 409 MP/s
CS - 460 MP/s

So now there's separation between OCL and CUDA as well. Weird.

Tim Murray
29-Nov-2009, 05:48
so clocks dropped ~12% but your OCL performance dropped 20%? uhh, this really is a weird bench, I guess.

CarstenS
29-Nov-2009, 11:02
Seconded, my GTX280 clocked at 602/1296/1107 (ref) score less than a 9800 GTX+ which is in Sandra's reference list of devices.

Forrest
29-Nov-2009, 21:03
I'm more than a little confused as to how switching to CS gets you a 15-20% performance improvement.

CS is where most of their market is so it should be the most optimized one right?

Psycho
30-Nov-2009, 00:22
More difference for ati:
5770@stock:
stream: 482mpix/sec
cs: 1068mpix/sec

bandwidth - stream & cs: 50gb/s

Tim Murray
30-Nov-2009, 06:42
CS is where most of their market is so it should be the most optimized one right?
without seeing the kernels, it's hard to tell, but I imagine they're not perfectly equivalent. can't think of anything off the top of my head that would account for the differences beyond compiler changes, and the CUDA compiler should be the most mature of the three (plus Carsten's comments are odd).

NV guys, what drivers are you running?

Andrew Lauritzen
30-Nov-2009, 22:34
without seeing the kernels, it's hard to tell, but I imagine they're not perfectly equivalent. can't think of anything off the top of my head that would account for the differences beyond compiler changes, and the CUDA compiler should be the most mature of the three (plus Carsten's comments are odd).
Microsoft's compiler is involved with the DC path obviously... it's possible given a few differences in the languages that it can be more aggressive at optimization than CUDA. I'd have to go look into the details of CUDA again to refresh my memory, but just throwing ideas out there.

The other possibility is that DC can interact more directly with the WDDM. Certainly this is a huge win for "mixed-mode" applications, although it shouldn't be too important for synthetic tests like this, and particularly ones that aren't doing anything much on the CPU...

The OCL vs. CUDA differences are probably just an immature OCL implementation.

I'd still agree and put my money on these converging though as drivers mature.

trinibwoy
30-Nov-2009, 22:48
NV guys, what drivers are you running?

195.55 betas, win7 64-bit.

Andy, what would Microsoft's DC compiler output? Is it something similiar to IL/PTX? And if so what do the IHV's do with that?

OpenGL guy
30-Nov-2009, 23:23
195.55 betas, win7 64-bit.

Andy, what would Microsoft's DC compiler output? Is it something similiar to IL/PTX? And if so what do the IHV's do with that?
MS's DC compiler is the same HLSL compiler that's been around for years. It outputs DX shader tokens and those get passed to the driver, just like for any other shader.

Tim Murray
01-Dec-2009, 05:15
The other possibility is that DC can interact more directly with the WDDM. Certainly this is a huge win for "mixed-mode" applications, although it shouldn't be too important for synthetic tests like this, and particularly ones that aren't doing anything much on the CPU...
If you're streaming data to the GPU, running a kernel, and streaming it back, WDDM shouldn't have any impact. Only thing I can think of for the 57xx cards is that DC gets to use append buffers or something like that for a major performance improvement by avoiding CPU roundtrips (which can be a killer on WDDM), but that would make for a weird benchmark.

CUDA does support the restrict keyword, so you should be able to get some optimization benefits by using that that you wouldn't otherwise get thanks to pointers.

trinibwoy
01-Dec-2009, 12:48
MS's DC compiler is the same HLSL compiler that's been around for years. It outputs DX shader tokens and those get passed to the driver, just like for any other shader.

Interesting, thanks.

Andrew Lauritzen
01-Dec-2009, 19:05
If you're streaming data to the GPU, running a kernel, and streaming it back, WDDM shouldn't have any impact.
Not sure this is true... this definitely all goes through the WDDM in DC, and perhaps in CUDA too (depending on whether they tunnel it through there or try to abstract their own little space). Typically the WDDM/OS owns all GPU memory and command buffer submission on Windows though, so normally anything that goes to the GPU goes through there in some manner.

Now obviously an IHV can go around that (rather than through it) to some extent with things like CUDA, but potentially with some pitfalls.

Silent_Buddha
01-Dec-2009, 23:41
How much direct access does MS allow to hardware? Wasn't WDDM a move towards removing direct access in order to reduce the chances of a buggy driver blue-screening the OS?

Or a problem piece of hardware doing the same?

Regards,
SB

OpenGL guy
02-Dec-2009, 00:40
How much direct access does MS allow to hardware? Wasn't WDDM a move towards removing direct access in order to reduce the chances of a buggy driver blue-screening the OS?

Or a problem piece of hardware doing the same?
Since DirectCompute runs within DirectX (and even uses the D3D driver), buffer sharing is trivial. Sharing buffers between other APIs (such as between OpenCL and OpenGL) may be more problematic.

Tim Murray
02-Dec-2009, 07:47
Not sure this is true... this definitely all goes through the WDDM in DC, and perhaps in CUDA too (depending on whether they tunnel it through there or try to abstract their own little space). Typically the WDDM/OS owns all GPU memory and command buffer submission on Windows though, so normally anything that goes to the GPU goes through there in some manner.

Now obviously an IHV can go around that (rather than through it) to some extent with things like CUDA, but potentially with some pitfalls.
Let me rephrase--the additional overhead imposed by WDDM versus other driver models (e.g., the cost of accessing hardware in Linux) should cause no meaningful performance penalties in the basic streaming case. What I was wondering in particular is if DC could be faster because of some hooks into WDDM, for instance (to avoid some additional user/kernel transitions), but I don't know enough about DC to be able to say either way.

CarstenS
02-Dec-2009, 08:23
Do CS kernels count as being evil in terms of the concurrent kernel execution stuff or do they behave nicely in NV-GPUs and do not block the chip for their duration?

Andrew Lauritzen
03-Dec-2009, 00:44
Let me rephrase--the additional overhead imposed by WDDM versus other driver models (e.g., the cost of accessing hardware in Linux) should cause no meaningful performance penalties in the basic streaming case.
Agreed, although it all depends on how - for instance - the CUDA and OpenCL runtimes are implemented with respect to WDDM. There's always a possibility of some inefficiencies there, although again I'd assume that with simple streaming kernels we're going to be far from CPU-limited.

Do CS kernels count as being evil in terms of the concurrent kernel execution stuff or do they behave nicely in NV-GPUs and do not block the chip for their duration?
I'm assuming this is a limitation of current NVIDIA hardware (specifically since it is called out as a benefit of Fermi) rather than anything in software. Thus I expect DC on current NVIDIA hardware to behave like the other APIs.

neliz
03-Dec-2009, 22:05
SiSoft's own benchmark numbers:

SP/DP: http://sisoftware.co.uk/index.html?dir=qa&location=cpu_vs_gpu_proc&langx=en&a=

Bandwidth: http://sisoftware.co.uk/index.html?dir=qa&location=cpu_vs_gpu_mem&langx=en&a=

Tim Murray
03-Dec-2009, 22:12
Their PCIe numbers are totally crazy; getting 5+GB/s on NV hardware (and AMD hardware, for what it's worth) is trivial. Maybe they're using paged memory?

edit: yeah, I get 5.8/5.2 GB/s on my i7 system for pinned memory and 3GB/s bidirectional for paged, but PCIe transfers from paged memory are more of an issue of CPU memcpy bandwidth than PCIe performance (which is why they're so radically higher on Nehalem than previous Intel CPUs).

Forrest
04-Dec-2009, 00:33
Their PCIe numbers are totally crazy; getting 5+GB/s on NV hardware (and AMD hardware, for what it's worth) is trivial. Maybe they're using paged memory?

edit: yeah, I get 5.8/5.2 GB/s on my i7 system for pinned memory and 3GB/s bidirectional for paged, but PCIe transfers from paged memory are more of an issue of CPU memcpy bandwidth than PCIe performance (which is why they're so radically higher on Nehalem than previous Intel CPUs).

Totally crazy PCIe numbers? sure! :lol:

I get 0.4 GB/s in OpenCL on 4850.

And this one :

http://img707.imageshack.us/img707/5830/opencl.jpg

1/8th memory transfer performance. :???:

neliz
04-Dec-2009, 07:13
I didn't even take the time to check the graphs properly.. but what about the CUDA/OCL variance between two essentially the same platforms, the 9400M and ION?

Florin
05-Dec-2009, 14:24
However I'm running into some issues with the CUDA and DirectCompute benchmarks. Both of these consistently generate the 'Display driver nvlddmkm stopped responding and has successfully recovered.' error.

Removing and reinstalling the Nvidia driver after installing Sandra fixed this. Not sure what's going on but fwiw.

Arnold Beckenbauer
09-Dec-2009, 20:20
There is a new version of SiSoft Sandra 2010: 1611.
How can I get OpenCL working with my HD4850? OpenCL&CPU work.

CarstenS
15-Dec-2009, 16:09
I don't get it to work with HD 5870 either. Freshly installed Windows 7, Cat 9.11 WHQL, Sandra 1611, Stream SDK 2.0 b4 - Stream and Direct Compute work, but I don't get anything Open CL.

Any hints?

neliz
15-Dec-2009, 18:20
Any hints?

Wait for tomorrow's Cats and pray Terry did his job?

CarstenS
16-Dec-2009, 08:51
Hopefully - thx for the exceptionally helpful hint btw ;-)

edit:
Never mind - seems like a royal FU with Sandra 2010-1611 or with me. Everything else OpenCL works.

neliz
16-Dec-2009, 10:41
Hopefully - thx for the exceptionally helpful hint btw ;-)

A hint would be something about a GTS330, GTS350, GTX360/380 and HD5830 and 5890.. I'm not giving those

CarstenS
16-Dec-2009, 11:32
I meant that rather sarcastic, given that everyone in the world and their dogs follow Terry on Twitter. ;)

neliz
16-Dec-2009, 11:35
I meant that rather sarcastic, given that everyone in the world and their dogs follow Terry on Twitter. ;)

He never said it would support OpenGL 3.2 though..;) (but that's not the subject at hand)

Arnold Beckenbauer
19-Dec-2009, 00:39
http://developer.amd.com/support/KnowledgeBase/Lists/KnowledgeBase/DispForm.aspx?ID=72

KB72 - Running SiSoftware Sandra 2010 OpenCL™ GPGPU benchmarks with the ATI Stream SDK v2.0-beta4
It works.