nVidia's Island DirectX 11 Demo runs slowly on AMD GPUs

Our Radeon HD 5870 equipped with 2GB GDDR5 of video memory was only able to achieve single-digit numbers while running extensive DirectCompute physics calculations. Given that GF100 chips have vastly more L1 and L2 cache than ATI, running the demo on ATI hardware results in constant spilling into video memory, causing a large difference in framerate. Given that nVidia didn't design this demo with AMD hardware in mind, it is good to see it working in any shape or form. However, we do take notice that AMD's demos ran flawlessly, while this demo played in sub-10 fps range.
http://www.brightsideofnews.com/news/2010/3/30/nvidias-island-directx-11-demo-works-on-amd-gpus.aspx

Theo's analysts seems to be wrong.

If you program to DX11 and DirectCompute isn't that vender neutral.
Better hardware would render faster. A 5870 should be faster than a 5830.

So with Fermi faster rendering shouldn't it be because Fermi has better hardware to do DirectCompute and Tessellation?
 
True, but it could be an issue of developer productivity. It could be that Fermi's caches and architecture permit simpler straightforward code, or 'sloppier' less-frugal code. I'm more productive if I can take a traditional algorithm, and run it on an out-of-order CPU with traditional cache, vs an in-order CPU with manually managed local store for example.

And sometimes, it might even be an issue where "tuning" doesn't work, and you need a separate path. Pre-G8x Nvidia GPUs had terrible dynamic branching behavior, so no amount of tweaking inputs or registers would solve the problem.

Of course, it could also be a matter of not leveraging what AMD is good at, and coding for their memory architecture and atomics. That's not really tuning tho, as it forces the developer to maintain two separate paths, and there's still the possibility that even with custom paths, Fermi is just better suited to physics workloads.

I'll note that I don't like the bifurcation that's happening because of the different memory/cache architectures of Fermi vs Cypress, but I can't criticize NVidia's decisions to amp up the caches, since it appear well within their rights to implement the DX Compute/OpenCL spec using an architecture like this. The divergence is regrettable as it will add pain for developers.
 
True, but it could be an issue of developer productivity. It could be that Fermi's caches and architecture permit simpler straightforward code, or 'sloppier' less-frugal code. I'm more productive if I can take a traditional algorithm, and run it on an out-of-order CPU with traditional cache, vs an in-order CPU with manually managed local store for example.
Or it could be something as simple as vectorizing your code when possible. You wouldn't use x87 when SSE was appropriate on a CPU.
DemoCoder said:
And sometimes, it might even be an issue where "tuning" doesn't work, and you need a separate path. Pre-G8x Nvidia GPUs had terrible dynamic branching behavior, so no amount of tweaking inputs or registers would solve the problem.

Of course, it could also be a matter of not leveraging what AMD is good at, and coding for their memory architecture and atomics. That's not really tuning tho, as it forces the developer to maintain two separate paths, and there's still the possibility that even with custom paths, Fermi is just better suited to physics workloads.
The OP was referring to DX11 and DirectCompute. Plenty of ways to take these vendor-neutral APIs and create workloads that favor one architecture over another, particularly with DirectCompute.
 
Maybe GF100 has the better Tessellation implementation - and the water demo is nVidia's Tessellation showcase.
 
Isn't the whole point of tessellation to reduce bandwidth by creating the extra details on chip? What's being spilled out to memory? Not the extra triangles...
 
Or it could be something as simple as vectorizing your code when possible. You wouldn't use x87 when SSE was appropriate on a CPU.

Depends on the situation, in the case of something like the water simulation, I'd be inclined to agree with you, but as you know, many compilers and VMs offer auto-vectorization features, and C programmers don't automatically write code to leverage x87 vs SSE in every situation, I see lots of math related code written in a scalar fashion. My point is, if Fermi has better performance on non-vectorized code, and if the non-vectorized code performs on par as the hand vectorized code, then Fermi presents a net win for the developer since he can obtain equal performance for less development effort.

The OP was referring to DX11 and DirectCompute. Plenty of ways to take these vendor-neutral APIs and create workloads that favor one architecture over another, particularly with DirectCompute.

I'm in 100% agreement, and I don't see the situation changing anytime soon with respect to the current tools and APIs devs have to work with, and I don't blame either IHV for this, it's just a hard problem to create an abstract platform that can fully leverage divergent platforms. However, I would like to point is that there is a 'ease of development' story here, just as there is with single-thread OoO x86/cache vs in-order PP/LDS (e.g. Cell). There obviously things that HW can do to make devs and tool vendors jobs easier. Two platforms with equal sustained performance can still impose quite different costs on devs.
 
How do the Ladybug and Mecha D3D11 demos run on GTX480?

Regardless, I think that's the nicest GPU water I've ever seen :cool:

Jawed
 
How do the Ladybug and Mecha D3D11 demos run on GTX480?
Actually, pretty good (we were planning to do diagrams on them too, but they somehow got lost). But then, they do not over emphasize stuff where GF100 is way behind Cypress but are normal technology showcases for general techniques with DX11. You could argue of course, that tessellation also is a general DX11 technique.
 
The OP was referring to DX11 and DirectCompute. Plenty of ways to take these vendor-neutral APIs and create workloads that favor one architecture over another, particularly with DirectCompute.
Yes this definitely can't be emphasized enough, as I noted in the Fermi thread. Performance portability is going to be less and less common moving forward. Already an increasingly large number of constants need to be tweaked for various architectures and Fermi and Cyprus have some fundamental architectural differences that seem to affect even which algorithms you should use. This is not unlike the CPU space, but it's a bit more extreme: performance cliffs are orders of magnitude rather than single-digit %ages.

This is neither good nor bad... it's just something people need to be aware of now that we're writing fairly low level code compared to the traditional graphics pipeline. Thus it's going to be harder to summarize things about which architecture is "better" in broad categories... the answer is almost always going to be "it depends" now.
 
My guess is that this particular test has nothing to do with L1 and L2 or even the DirectCompute part of the workload, as wave physics are very straightforward and don't need complex data structures using weird access patterns.

NVidia just cranked up the tessellation to obscene levels. You can see in the screenshot that there are 12M triangles created. Fermi can do 2-3 tessellated triangles per clock, Cypress can do 1/3rd. That's a factor of 5 at least, and sometimes more.
 
NVidia just cranked up the tessellation to obscene levels.
Definitely possible. You can see in the NVIDIA videos that while the quality changes very little after the first few steps the application defaults to a fairly high tessellation level. I'm debating whether the frame rate in the videos on youtube are correct, as it appears pegged at 25fps on the 480 regardless of the tessellation level, which seems odd assuming a fairly simple wave physics step.

Has anyone played with how this scales with tessellation levels across ATI and NVIDIA? I'd love to see a graph.
 
I may have spoken a bit too soon, it would appear the demo isn't on the public domain yet? In which case we'll have to postpone any investigation until it becomes public domain, since I don't have access to nV's top secret reviewer sauces:???:
 
http://www.ixbt.com/video3/gf100-2-part2.shtml

Near the bottom, shows that "LOD 1" is faster on GTX480 than on HD5870. Don't know if that is "tessellation off".
Sweet, I love their reviews. Given that the GTX480's advantage is under 5% at LOD1, that confirms by beliefs. ATI's large perf deficit has nothing to do with the compute shader and everything to do with tessellation.

If we assume that ixbt's setting of LOD=50 gives about the same triangle count as bson's settings (since both give 9.4 fps on Cypress), then we see that those 12M triangles take 84.5ms extra over the LOD=1 case, which works out to exactly 6 clocks per triangle. Considering Damien's figure of 3 clock per tesselated tri and his information about degrading performance when multiple input vertices are used on non-Fermi architectures, this makes sense.

I still don't understand why ATI designed it to be so slow. The shader processors are very fast and have extremely high bandwidth to the L1 and even L2. Are you sure that Cypress doesn't use a seperate vertex cache anymore? Read port limitations on the cache holding the control points (e.g. 2 vec4's per clock) is the only reason I can think of.
 
Last edited by a moderator:
AMD claimed the vertex cache was omitted. I think they said it now goes through the texture caches.

Could it be serialization over that crossbar between the two sides?
Cycle 0: tesselator output
Cycle 1: send to SIMD bank 0
Cycle 2: send to SIMD bank 1

I'm probably wrong on how I imagine the data path being followed.

edit: Might be easier to just broadcast to both sides.
 
Sweet, I love their reviews. Given that the GTX480's advantage is under 5% at LOD1, that confirms by beliefs. ATI's large perf deficit has nothing to do with the compute shader and everything to do with tessellation.

If we assume that ixbt's setting of LOD=50 gives about the same triangle count as bson's settings (since both give 9.4 fps on Cypress), then we see that those 12M triangles take 84.5ms extra over the LOD=1 case, which works out to exactly 6 clocks per triangle. Considering Damien's figure of 3 clock per tesselated tri and his information about degrading performance when multiple input vertices are used on non-Fermi architectures, this makes sense.
The text says that with LOD=100 there are 28mln triangles, while at LOD=25 there are 4mln tris
480@100 is as fast as 5870@25
 
Back
Top