Beyond3D's GT200 GPU and Architecture Analysis

Why are the 9800GTX, GX2, and GTX280 so bunched up in Vantage's Cloth sim (here, too)? Tridam shows that GT200 keeps up with G80, so what's the hold up?

I see POM and Perlin scaling with ALUs, but Cloth (and Particles, depending on the review--though both Ars and ET used the same test system, QX9650 + X38) doesn't scale to the extent I'd expect. Is it scaling with the increase in TPCs, 8 in G80 to 10 in GT200?

BTW, Rys, how can you and Scott compare GT200 to the same butt but use two different words to describe the TPC (texture vs. thread)? :) I'm just wondering what term NV's promoting, if any. I can't use Damien as the tie-breaker, as I think he just calls them partitions.

Oh, and I guess it was a decent analysis. ;)
 
Why are the 9800GTX, GX2, and GTX280 so bunched up in Vantage's Cloth sim (here, too)? Tridam shows that GT200 keeps up with G80, so what's the hold up?
See Nvidia's cloth demo white paper for a description of how the method works.

Since destination buffers are used as sources for the next draw a lot of time is spent flushing the pipe. The term used in the paper is "Swap position buffers".

As more cloth is added to a scene this should matter less. Depending on the implementation of course. Multiple pieces of cloth could make it worse.

EDIT: this is also why this app doesn't scale with multiple GPUs.
 
Triangle Setup Limited

This was mentioned in the article but is it really that much of a limitation? 602m triangles per second is a lot of triangles, ~10m per frame at 60fps. Are we expecting games to require that level of polygon detail?

With that many triangles isn't GTX200 likely to be bound by other areas like bandwidth, texturing or shading?

I realise that 602m is only a little above Xenos's 500m and even below the 9800GTX's 650m but won't all of these architectures be limited by other factors long before hitting that setup limit?
 
How much higher clockspeed in the graphics core domain would GTX 280 need, beyond 602 MHz, to reach the 1 teraflop and 1.2 teraflop milestones AMD has apparently reached with 4850 and 4870 ?
 
How much higher clockspeed in the graphics core domain would GTX 280 need, beyond 602 MHz, to reach the 1 teraflop and 1.2 teraflop milestones AMD has apparently reached with 4850 and 4870 ?

Assuming a 2.5 shader/core clock ratio it would take ~ 670Mhz core to hit 1.2 teraflop.
 
This was mentioned in the article but is it really that much of a limitation? 602m triangles per second is a lot of triangles, ~10m per frame at 60fps. Are we expecting games to require that level of polygon detail?
It's very possible. You have cascaded shadow maps, which have to run through the entire scene's geometry multiple times (though good object culling will reduce that to 1.x times), and need to render more than what's visible, too. You sometimes have local shadow maps. You have environment maps - cube or planar - for reflections. You have Z-only passes and other multipassing depending on the specifics of the render engine.

All these things take a scene's polygon count and multiply it to a much bigger number that's fed to the GPU. Shadow map rendering is particularly setup limited. Say GT200 can do 30 GPix/s uncompressed z-only fillrate (BW limited). The pixels in a large 2048x2048 shadow map with 3x net overdraw could be rendered in 0.4ms. You can't even set up a quarter million polys in that time.

It's also impossible to have the pixel:triangle ratio to stay constant even when accounting for post-transform caches/FIFOs, so in reality you probably can only do 5M total polys at 60fps to keep yourself setup-limited less than, say, 30% of the time.

Setup limitations are especially relevent at lower resolutions. GT200 may let you play Crysis with twice the resolution than your old card did at the same FPS, but at the same resolution you won't get close to a 100% increase in FPS. A lot of people would like the latter.

This is also a reason that consoles can sort of keep up with faster PC hardware, because they generally render at lower resolution while having similar setup rates.
 
It's very possible. You have cascaded shadow maps, which have to run through the entire scene's geometry multiple times (though good object culling will reduce that to 1.x times), and need to render more than what's visible, too. You sometimes have local shadow maps. You have environment maps - cube or planar - for reflections. You have Z-only passes and other multipassing depending on the specifics of the render engine.

All these things take a scene's polygon count and multiply it to a much bigger number that's fed to the GPU. Shadow map rendering is particularly setup limited. Say GT200 can do 30 GPix/s uncompressed z-only fillrate (BW limited). The pixels in a large 2048x2048 shadow map with 3x net overdraw could be rendered in 0.4ms. You can't even set up a quarter million polys in that time.

It's also impossible to have the pixel:triangle ratio to stay constant even when accounting for post-transform caches/FIFOs, so in reality you probably can only do 5M total polys at 60fps to keep yourself setup-limited less than, say, 30% of the time.

Setup limitations are especially relevent at lower resolutions. GT200 may let you play Crysis with twice the resolution than your old card did at the same FPS, but at the same resolution you won't get close to a 100% increase in FPS. A lot of people would like the latter.

This is also a reason that consoles can sort of keep up with faster PC hardware, because they generally render at lower resolution while having similar setup rates.

Thanks for the insight Mint.

Any particular reason why they don't improve triangle setup significantly? Is it expensive?
 
Any particular reason why they don't improve triangle setup significantly? Is it expensive?
I'm not too sure.

I think preserving triangle order has something to do with it. It complicates parallelism a bit, so maybe doubling setup rate (along with everything else like peak vertex rate from the VS, scan converter and HiZ, etc) would need, say, three times the transistors for that subsystem. I'm just making up these numbers, but a situation like this would still be worth it, IMO.

Another issue may be that many vertex buffers are stored in system memory, so PCI-E bandwidth is limiting setup anyway.
 
Back
Top