AMD: R9xx Speculation

GTX460 is faster than HD5870 in Civ 5:

http://www.techspot.com/review/320-civilization-v-performance/page9.html

Even with tessellation off:

http://www.techspot.com/review/320-civilization-v-performance/page7.html

AMD had devrel involved in making that game.

As you point out, the GTX 460 is faster even without tessellation, and the performance gap is roughly the same. I haven't seen Civ 5 running, but I doubt the geometric demands are very high without tessellation, so why do you think that's the issue?

It just seems to me that Civ 5 simply runs better on NVIDIA hardware, these things happen.


I don't see Metro 2033 performance holding up, either (not even highest in-game settings):

http://www.techreport.com/articles.x/19844/12

That's odd. Those numbers really don't match Damien's. Maybe it's because Damien is running the benchmark with high quality settings.

Still, it's hard to say that tessellation is the culprit here.
 
Last edited by a moderator:
TS data is very small, though. It's just 4 bytes per vertex if you use a triangle strip, and close to half if you do caching. If you can stage just one kilobyte then you have several wavefronts of vertices buffered up.
Where do you get 4 bytes per vertex from? I'm seeing TS output in examples as float2 or float3.

Tessellation factor of 15 generates 337 triangles (slide 9):

http://developer.download.nvidia.com/presentations/2010/gdc/Tessellation_Performance.pdf

15 is what R600 can do, so it's nothing special in today's terms.

If an HS hardware thread of 16 patches (4 control points per patch for terrain tessellation using quad patches = 64 control points sharing a hardware thread) generates 337 triangles per patch, then that's ~5.4K triangles/vertices, 42KB assuming 8 bytes per vertex. Obviously, DS will drain those triangles as TS produces them, in batches of 64 vertices (that's ~84 batches).

Which is more frequent: a vertex exported from TS or a vertex shaded by DS? DS needs to be <16 cycles to keep up with TS, if there's one SIMD running HS/DS.

Slide 25 is another hint about ATI: Moving work to DS instead of PS may not be a win. Why would that be? Probably because the population of DS threads in flight at any one time is too small.

So the count of DS threads in flight is open to question. The more of them, the more aggregate LDS is available to support the output from HS and TS. But DS count is locked to HS count by SIMD usage, which means that HS/DS load-balancing isn't independent - it would be like having a non-unified GPU, back to the bad old days of VS-dedicated and PS-dedicated shader pipes.

Then we're left not knowing the wall clock duration of a DS invocation. The longer that is the more coarse-grained will be the usage of LDS.

GS uses coarse-grained ring buffers off-die, seemingly for these reasons. The consumption of the ring buffer is not tied to specific SIMDs and it's not severely limited in capacity.

NVidia uses L2 to smooth these coarse-grained lumps of data, and it uses load-balancing across the entire GPU twixt stages to maximise overall throughput. Neither of these options seem to be available in Cypress.
 
As you point out, the GTX 460 is faster even without tesselation, and the performance gap is roughly the same. I haven't seen Civ 5 running, but I doubt the geometric demands are very high without tesselation, so why do you think that's the issue?

It just seems to me that Civ 5 simply runs better on NVIDIA hardware, these things happen.
I gave Civ 5 as a generic example of HD5870 being too slow in absolute terms, tessellation on or off.

That's odd. Those numbers really don't match Damien's. Maybe it's because Damien is running the benchmark with high quality settings.
D3D10 mode without tessellation according to the text, as far as I can tell :???:
 
I'm wondering why you think it will have to last over a year. Surely when they move to 28nm sometime in 2011 they will replace cayman
No sign of 28nm right now, and it's going to be much harder than 40nm. So much so that 32nm was abandoned in order to focus on 28nm. So, I'm assuming another saga at least as bad as 40nm.
 
I gave Civ 5 as a generic example of HD5870 being too slow in absolute terms, tessellation on or off.

Fair enough, but if geometry isn't the issue, then improving that won't change anything.

D3D10 mode without tessellation according to the text, as far as I can tell :???:

I think that's a typo because it also says that tessellation is enabled. Plus it says DX11 on the graph.
 
Tessellation factor of 15 generates 337 triangles


These may be useful for future reference (and I probably should've written them out in the article somewhere...sigh):
  • for a quad domain, the number of triangles generated = 2 * tessFactor * tessFactor
  • for a triangular domain, the number of triangles =
    • 1 + 6 * sum(2 * i), i = 1 to (tessFactor / 2), if tessFactor is odd
    • 6 * sum(2 * i - 1), i = 1 to (tessFactor / 2), if tessFactor is even
 
Strange , in FEAR no AA is 229% faster than MLAA .
In fact MLAA performance is the same as 4X SSAA ! , but this is probably just an exception rather than a rule .

In all situations though , MLAA hit was larger than 4X MSAA .
Maybe it depends on framerate. If the framerate i 300 FPS, relative performance hit is large, but for framerates around 50-60 FPS it is significantly smaller (just like shader-based resolve on HD2900XT)
 
Maybe it depends on framerate. If the framerate i 300 FPS, relative performance hit is large, but for framerates around 50-60 FPS it is significantly smaller (just like shader-based resolve on HD2900XT)

The MLAA pass should take approximately the same time for any scene of a given resolution, so it would be expected that the hit is larger when framerate is high.
 
The MLAA pass should take approximately the same time for any scene of a given resolution, so it would be expected that the hit is larger when framerate is high.

In absolute terms is requires more time, yes. But in relative terms MLAA is a constant (ignoring setup-overhead and to-memory-spill in a situation where the framebuffer's size wreaks havok inside the chip). If it's 1% on 50x50, it's going to be 1% on 2000x2000, and it's going to be 1% of 25 frames' time.


If I'd see 50% loss using MLAA I would turn the scepticism part of my brain on and declare the tests bogus. How can somebody think MLAA is as slow as MSAA, there is just no way, not even in corner-cases. One could calculate how much shader-FLOPS you have to waste for the ms they loose in the stats, that's something like an entire X1950 you would "require" just for MLAA. :rolleyes:

It's more likely they didn't understand the Catalyst Settings and used MLAA+MSAA all the time, possibly also with AAA on. If that's the case the MLAA-time is the delta from one of the MSAA-numbers and the MLAA-number.
 
Last edited by a moderator:
In absolute terms is requires more time, yes. But in relative terms MLAA is a constant (ignoring setup-overhead and to-memory-spill in a situation where the framebuffer's size wreaks havok inside the chip). If it's 1% on 50x50, it's going to be 1% on 2000x2000, and it's going to be 1% of 25 frames' time.

Maybe, but if it takes 1% on a scene with X tris it will probably take less than 1% on a scene with 2X tris - all things equal. (Because it doesn't care about anything but the rendered image.)

My point was that if MLAA takes say 1 ms per frame in a given situation*, that's going to be a larger drop from 300 FPS to 230 than from 60 to 56.

*Meaning mainly hardware and resolution.
 
The MLAA pass should take approximately the same time for any scene of a given resolution, so it would be expected that the hit is larger when framerate is high.
Yup, and all games in that review show a MLAA processing time of 6.5-6.9 ms at 1920x1200.

Seems like it's a good solution for deferred rendering engines, but otherwise the better quality and lack of artifacts from MSAA is the better choice.

In absolute terms is requires more time, yes. But in relative terms MLAA is a constant (ignoring setup-overhead and to-memory-spill in a situation where the framebuffer's size wreaks havok inside the chip). If it's 1% on 50x50, it's going to be 1% on 2000x2000, and it's going to be 1% of 25 frames' time.
No, that's wrong. Ever notice how different games have different framerates at the same resolution? Or that virtually all games don't have framerate scale down at the same rate that pixel count goes up?

MLAA processing time per pixel will be roughly constant. Relative framerate impact can be all over the map.
 
Last edited by a moderator:
A significant factor in tesselation is the basic triangle count of the model u tesselate. Same object with 10 times more triangles will end up with 10 times more tesselated triangles. At factor 15 thats a lots of additional triangles.
Quite true.

In HAWX2 tessellation the original patches vary in size depending on the amount of target detail, because a factor of 64 isn't always enough.
 
MLAA processing time per pixel will be roughly constant. Relative framerate impact can be all over the map.

I think here occured a little semantic lock. :) We do mean the same.

tMLAA-per-unit -> algorithm constant
units-per-frame -> setting constant (every pixel)
frames-per-second -> variable

tMLAA-abs = tMLAA-per-unit * units-per-frame * frames-per-second

Variation in the relationship of tMLAA-abs / tMLAA-per-unit comes from variations in other parts of the pipeline, and not from MLAA. Relative to a constant number of whatever-pixels MSAA shows variable times, MLAA shows constant times. MLAA itself (isolated) is pretty much content-agnostic (in practice).

This does not hold for MSAA:

tMSAA-per-unit -> algorithm constant
units-per-frame -> variable (polygon-edge pixels)
frames-per-second -> variable

tMSAA-abs = tMSAA-per-unit * units-per-frame * frames-per-second

Relative to a constant number of edge-pixels MSAA shows constant times, MLAA shows constant times as always. MSAA itself (isolated) is pretty much content-dependent.

Sorry for the confusion, maybe I don't find the right wordings.
 
btw. if a game is e.g. texturing intensive and ALUs aren't utilized, can these resources be used for MLAA or not?
 
btw. if a game is e.g. texturing intensive and ALUs aren't utilized, can these resources be used for MLAA or not?
No, but if a game spends a considerable amount of time setup limited, and ATI decides to delay the MLAA until the next frame has already started rendering (i.e. a extra frame of latency), then it's possible to make best use of resources that way.
 
As you point out, the GTX 460 is faster even without tessellation, and the performance gap is roughly the same. I haven't seen Civ 5 running, but I doubt the geometric demands are very high without tessellation, so why do you think that's the issue?

It just seems to me that Civ 5 simply runs better on NVIDIA hardware, these things happen.

Current speculation is that the compute shader for streaming textures runs much faster on Nvidia hardware. I believe it was Anandtech that first brought this up.

Regards,
SB
 
MLAA is a post processing technique that detects edges (by finding different patterns like |, L, T, Z, stairstep, etc.) and then doing that traditional blurring thing that aa does. If you have a scene with a lot of edges found, you'll get a bigger performance hit than if you have a scene where a few patterns are detected.

Running MSAA (in game, or via CCC) is counter productive for performance and may not improve IQ to any appreciable degree. If a game lacks AF then applying it via CP might help improve pattern detection rates.
 
Back
Top