nVidia's Island DirectX 11 Demo runs slowly on AMD GPUs

Malo · Apr 1, 2010

That's very damn impressive, imagine what the next iteration of Fermi will be capable of once they are able to manufacture it well. I think ATi dodged a bullet with this release, but they better have something for the next family.

Jawed · Apr 1, 2010

HD5970 in this does not reach its theoretical TS/setup-rate performance margin over HD5870 (71%) until LOD 100.

So Cypress isn't TS/setup bound until the highest LOD.

The PN Triangles sample, tested further up the page, reaches HD5970's theoretical margin only on the final factor of 19.

So at lower tessellation factors HD5970's performance is being limited by something other than TS/setup. Not sure what's going on there.

The detail tessellation test further up the page is only about 54% faster on HD5970, so again falling short of being TS/setup-rate dominated.

The NVidia sample, geometry/compute with the hair shows no real variation in performance with tessellation on/off on Cypress but shows 14% more performance on GF100. No idea how tessellation is being used here.

Jawed

Mintmaster · Apr 1, 2010

chavvdarrr said:
The text says that with LOD=100 there are 28mln triangles, while at LOD=25 there are 4mln tris
480@100 is as fast as 5870@25

I don't really trust that. There's a big difference between 9.4fps w/ 12M triangles (bson screenshot) and 9.4fps w/ 4M triangles. The latter would imply 18 clocks per tri, which is too large of a hit.

Jawed said:
HD5970 in this does not reach its theoretical TS/setup-rate performance margin over HD5870 (71%) until LOD 100.

So Cypress isn't TS/setup bound until the highest LOD.

No, that when it's >95% bound by TS. At lower levels it's just not quite as bound, but still primarily so. It has to do the compute shader first, and can't get to the few large triangles while it's stuck on the small ones, so everything is serial.

Also, you're forgetting that the 5970 has to copy both the framebuffer and the simulation texture over PCIe, and this is bigger portion of the render time at lower tesselation levels.

The PN Triangles sample, tested further up the page, reaches HD5970's theoretical margin only on the final factor of 19.

So at lower tessellation factors HD5970's performance is being limited by something other than TS/setup. Not sure what's going on there.

You're assuming that there's no crossfire overhead. I don't know why you're comparing to the 5970 anyway. Even if crossfire overhead was zero, it renders 71% faster, too, so all workloads would be 71% faster.

Subtract 165 microseconds from the render time of the PN example. You'll get a roughly 70% advantage at all settings.

The detail tessellation test further up the page is only about 54% faster on HD5970, so again falling short of being TS/setup-rate dominated.

Same flawed logic.

The NVidia sample, geometry/compute with the hair shows no real variation in performance with tessellation on/off on Cypress but shows 14% more performance on GF100. No idea how tessellation is being used here.

Because GF100 increases in performance, the primitive count must be going down when tessellation is enabled. Cypress, however, process tesselated prims more slowly, so it takes a performance hit even with the reduced prim count.

Looks like most of the rendering time is spent on the physics calculation, however.

CarstenS · Apr 4, 2010

From the Nvidia Island Demo: It's actually a bit below maximum tessellation factor, that you can reach the 1/3-per-clock ratio for Cypress. Maximum for me was 274M prims/sec. On those settings i can go down to a tessellation factor of 43 (avg. exp. ratio 3,613 vs. 7,756 at factor 63) and still get 270M prims/sec on my 5870.

Settings where: Fullscreen, 640x480, all checkboxes unticked except for Query Pipeline Statistics.

Anteru · Apr 4, 2010

CarstenS said:
Maximum for me was 274M prims/sec.

Just for the record, on an optimized terrain renderer, the HD5870 reaches 810-840M triangles/second -- this is plain static geometry, no tesselation going on. Which is definitely impressive, as for instance a Quadro FX5800 doesn't go above 350M triangles/second on the same scene. So the setup/rasterizing efficiency for "plain" triangles is really 1 triangle/clock on ATI for real-world scenes, which makes me wonder why the tesselated case is 6x slower -- I would have expected equal triangle throughput for both.

CarstenS · Apr 4, 2010

Yeah, I've seen triangle rates in the range of 770M also on the rather oldih Xvox-Demo (which btw, seems to scale exceedingly well) and that's of course without tessellation (it's DX8!).
edit: Even with AMDs own terrain tessellation (DX9) from the DirectX SDK, there's a maximum of ~580ish M triangles I've seen.

As to why tessellation on current AMD hardware is slower than expected has been discussed here already, but no one could come up with decisive evidence for either of the proposed theories.

Silent_Buddha · Apr 4, 2010

I wonder if that's something they are working on fixing with the rumored "southern islands" chips. After all rumors paint it as not being much faster in current applications. Improved tesselation performance would still fit into those rumors as it wouldn't have much impact on most current applications.

Then again, it's probably too soon for a product to have that kind of improvement if it was assumed the tesselation engine was just fine during development. Or perhaps they used whatever was learned during Evergreen development to improve Southern Islands. Or maybe any improvements won't show up until NI.

Regards,
SB

Mintmaster · Apr 4, 2010

chavvdarrr said:
The text says that with LOD=100 there are 28mln triangles, while at LOD=25 there are 4mln tris
480@100 is as fast as 5870@25

Sorry, I misread your post earlier. The 9.4 fps from ixbt was for LOD=50, and your 4m tri number was for LOD=25, so there's no conflict with bson.

Doing the math for LOD=25 and LOD=100, we get ~6.5 clocks per tri for both.

CarstenS said:
From the Nvidia Island Demo: It's actually a bit below maximum tessellation factor, that you can reach the 1/3-per-clock ratio for Cypress. Maximum for me was 274M prims/sec. On those settings i can go down to a tessellation factor of 43 (avg. exp. ratio 3,613 vs. 7,756 at factor 63) and still get 270M prims/sec on my 5870.

FYI, the reason your prims/sec drops is the compute shader starts to take a larger portion of frame time when the FPS goes up at lower tessellation factors.

Settings where: Fullscreen, 640x480, all checkboxes unticked except for Query Pipeline Statistics.

At first I was a bit surprised by your number, because I'm calculating 1 prim every 6 clocks, but it looks like the reason is dynamic tesselation is enabled in bson and ixbt's benchmarks, but you have it unticked. This would also explain my Unigine calculations from B3D's numbers and also jives with Damien's test. Unigine doesn't use dynamic tessellation and I got 3 cycles/tri, and Damien said he got 278Mtri/s max and less when you needed more data per vertex.

I downloaded the PNTriangles sample, and found that the pre-tessellated primitive count of the default model ('Tiny') was 20523. The amplification factors for 1, 5, 9, 19 are 1, 37, 121, and 541, respectively, at least with the refrast tessellation. Can anyone check the numbers that ixbt got? If they're correct, then Cypress is getting more than 0.5 tris/clk for the 5 and 9 cases. At the highest tessellation, Fermi is doing just under four tris per clock.

I hope Dave can give us a hint as to why the throughput is low with tessellation. I'm going to stick with my theory that it's some sort of throughput issue in getting data to the cache or from it to feed the domain shader.

CarstenS · Apr 5, 2010

Don't know what settings Xbit ran the sample at, but with default, only ticking "Tessellation", I'm getting
2746 (sic!) -1096 - 199 - 37,5 Fps
on my HD 5870 with Cat 10.3 and the sample from Microsofts DX SDK (Feb 2010).

Now, with extra-high Fps counts, I've seen it before that fast C2D-Systems can vastly outperform Nehalem-based setups - don't know why though.

Mintmaster · Apr 5, 2010

CarstenS said:
Don't know what settings Xbit ran the sample at, but with default, only ticking "Tessellation", I'm getting
2746 (sic!) -1096 - 199 - 37,5 Fps
on my HD 5870 with Cat 10.3 and the sample from Microsofts DX SDK (Feb 2010).

So it looks like the numbers are legit. ATI's tessellator is either faster at integer tesselation or the bottleneck is somewhere else, like feeding the domain shader.

Andrew Lauritzen · Apr 6, 2010

By the way, the sample (as well as the hair sample) executable is linked from this review:
http://www.hitechlegion.com/reviews...tx-480-directx-11-video-card-review?showall=1

Scroll down to "Download the Water and Hair Tessellation Demos."

Shaders are plain text HLSL. The demos run fine on ATI cards and I'm taking a quick peek at what they're doing as we speak.

One odd note: it seems like enabling caustics has a huge hit on ATI... not sure about NVIDIA. Drops the pipeline tessellation statistics quite significantly.

Otherwise though if you back off on the tessellation a bit though it runs smoothly on ATI with very little quality loss. That said, it is most certainly geometry limited as changing the resolution basically doesn't affect the performance at all.

3dcgi · Apr 6, 2010

Mintmaster said:
Unigine doesn't use dynamic tessellation

Why do you think Unigine doesn't use dynamic tessellation?

Mintmaster · Apr 6, 2010

3dcgi said:
Why do you think Unigine doesn't use dynamic tessellation?

Sorry, I used the wrong word. It uses the same tessellation factor for all triangles in a model. There's no fractional tessellation, and no triangles with different tessellation factors on each edge.
http://unigine.com/devlog/page14/

That being said, does Unigine change the tessellation factor for an object as distance changes?

3dcgi · Apr 6, 2010

Mintmaster said:
That being said, does Unigine change the tessellation factor for an object as distance changes?

I've seen tessellation change with distance.

CarstenS · Apr 7, 2010

CarstenS said:
From the Nvidia Island Demo: It's actually a bit below maximum tessellation factor, that you can reach the 1/3-per-clock ratio for Cypress. Maximum for me was 274M prims/sec. On those settings i can go down to a tessellation factor of 43 (avg. exp. ratio 3,613 vs. 7,756 at factor 63) and still get 270M prims/sec on my 5870.

Settings where: Fullscreen, 640x480, all checkboxes unticked except for Query Pipeline Statistics.

Sorry for self-quote, just to give some perspective: I've been running the same demo on our test-PC which is vastly superior to my home PC from above, so take this with a [strike]grain[/strike] ton of salt for the time being.

With the same settings as above, a GTX 480 achieves 1,388M prim/sec. If i move the view around a bit to lessen the shader load, it can get up to 1,633M prim/sec.

jimmyjames123 · Apr 7, 2010

Dude, clean up your inglish Carsten, what's da matta wit u?

Seriously though, good data, do you have any of this data at resolutions higher than 640x480?

dkanter · Apr 7, 2010

OpenGL guy said:
Or it could be something as simple as vectorizing your code when possible. You wouldn't use x87 when SSE was appropriate on a CPU.

Have you ever used PhysX : )

David

CarstenS · Apr 7, 2010

jimmyjames123 said:
Dude, clean up your inglish Carsten, what's da matta wit u?
Seriously though, good data, do you have any of this data at resolutions higher than 640x480?

GTX 480 at 1920x1200 achieves almost the same primitve rate: Just a tad under 1,300M and when changing views - actually looking down at the rocks, that is - it gets up to 1,600M again.

Have no 5870 at hand right now, but will try later.

What kills primitive rate in this demo though is to enable refraction rendering. It just about halves prim rate.

edit:
With my HD 5870 at home, I'm basically getting the same results in 1920x1200: 270M prims a sec and with the view down to the ground 274M. Again: That's with all checkboxes unticked, except for query pipeline statistics, so there's no adaptivity in tessellation, no frustrum culling in Hull Shaders and no caustics/refraction rendering for the water. Please note that the frustrum culling in Hull Shader would increase the Fps of the demo, but lower the created prims/sec.

3dcgi · Apr 8, 2010

Culling as much as possible in the HS should be beneficial for all hardware. Hopefully developers will do this.

Mintmaster · Apr 8, 2010

3dcgi said:
Culling as much as possible in the HS should be beneficial for all hardware. Hopefully developers will do this.

It should, but it's often an inexact science because the HS has to predict whether the DS will make triangles visible even when the patch is off-screen or backfacing.

nVidia's Island DirectX 11 Demo runs slowly on AMD GPUs

Malo

Yak Mechanicum

Jawed

Mintmaster

CarstenS

Moderator

Anteru

CarstenS

Moderator

Silent_Buddha

Mintmaster

CarstenS

Moderator

Mintmaster

Andrew Lauritzen

Moderator

3dcgi

Mintmaster

3dcgi

CarstenS

Moderator

jimmyjames123

dkanter

CarstenS

Moderator

3dcgi

Mintmaster

Similar threads