AMD: R9xx Speculation

People need to stop making these simplistic calculations. Triangles come in clumps. When you have a bunch of small or hidden triangles, there are very few pixels drawn to the screen during that time. You basically have to deal with those triangles and draw the majority of the scene in the remaining time.

Here's a more detailed post I did on the matter:
http://forum.beyond3d.com/showpost.php?p=1383571&postcount=481

What I don't understand is why ATI can't do one tessellated triangle per clock. Cypress was taking 3-6 clocks per triangle in tessellation tests, and it looks like this generation is barely any better.

What on earth is going on inside that tessellation engine? It's inexcusable to be drawing equivalent, bandwidth-wasting, pretessellated DX10 geometry faster than tessellated DX11 geometry.

EDIT: Maybe ATI was selling itself short with that slide if this benchmark is correct...
http://www.beyond3d.com/content/reviews/55/10

There's a currently running meme about Cypress taking 3 clocks per tessellated triangle - this is incorrect in an absolute sense, albeit we can generate that scenario quite easily, as we can do a bit better than what you're seeing (note we've reached up to ~600 MTris/s by using triangular patches, thus trimming down the per control point data, all else being equal).

ATI's problem is primarily one of data-flow: they try to keep some data in shared memory (as far as we can see, they try to keep HS-DS pairs resident on the same SIMD, with hull shaders being significantly more expensive then domain shaders) but data to and from the tessellator needs to go through the GDS. There's also the need to serialise access to the tessellator, since it's an unique resource, coupled with a final aspect we'll deal with when looking at math throughput.

Given all this, fatter control points (our control points are as skinny as possible) or heavy math in the HS (there's no explicit math in ours, but there's some implicit tessellation factor massaging and addressing math) hurt Cypress comparatively more than they hurt Slimer - and now you know how the 3 clocks per triangle scenarios come into being, a combination of the two aforementioned factors.
 
There's a currently running meme about Cypress taking 3 clocks per tessellated triangle - this is incorrect in an absolute sense, albeit we can generate that scenario quite easily, as we can do a bit better than what you're seeing (note we've reached up to ~600 MTris/s by using triangular patches, thus trimming down the per control point data, all else being equal).

ATI's problem is primarily one of data-flow: they try to keep some data in shared memory (as far as we can see, they try to keep HS-DS pairs resident on the same SIMD, with hull shaders being significantly more expensive then domain shaders) but data to and from the tessellator needs to go through the GDS. There's also the need to serialise access to the tessellator, since it's an unique resource, coupled with a final aspect we'll deal with when looking at math throughput.

Given all this, fatter control points (our control points are as skinny as possible) or heavy math in the HS (there's no explicit math in ours, but there's some implicit tessellation factor massaging and addressing math) hurt Cypress comparatively more than they hurt Slimer - and now you know how the 3 clocks per triangle scenarios come into being, a combination of the two aforementioned factors.

If the problem is data flow of the hardware implementation than wouldnt it be faster to actualy running the whole thing as a decent software tesselation :?:. (especialy for games where u use it for just a single thing like landscape simulation)
The nvidia demo instanced tesselation (http://developer.download.nvidia.com/SDK/10.5/direct3d/samples.html#InstancedTessellation) is runing vsynced 60 fps (god knows whats the real fps) at any tesselation level up to the 32 max on my 4850. :!:
 
I really hate to bring up the topic again but while a bit awkward vs previous generations I can't see a more sensible naming scheme in the case of keeping on selling 5770 & 5750s for the meantime while presumably bringing out a 28nm replacement later on.

We will presumably have this:
6990 Antillies (6970 x2?)
6970 Cayman
6950 Cayman
6870 Barts
6850 Barts
5770 Juniper (replaced by 6770?)
5750 Juniper (replaced by 6750?)

But are either of these better?
6970 Antillies
6870 Cayman
6850 Cayman
6770 Barts
6750 Barts
5770 Juniper
5750 Juniper
It would fit previous generations but there are too many 7s.
Either 67xx or 57xx are likely to not sell well.
Its awkward & likely to cause bad tech support issues having a 6770 with 2* 6pin PCIE connectors when previous x770s have been single.

6990 Antillies
6870 Cayman
6850 Cayman
6770 Barts
6750 Barts
6670 Juniper
6650 Juniper
WTFBBQ renaming an old chip!!! and 28nm replacement of 66x0 would be called what?
 
Actually, AMD is accusing NVidia of hobbling performance for 100% of the D3D11 gamers out there. NVidia's cards are also running slower than they have to, though it just makes less of a difference to them.
You could make a case for some NVidia cards not making 120fps for stereo 3D, but otherwise 60fps seems dandy.

Looking at the dev talk that Sontin pointed to, the tessellation really is quite excessive.
I'm not prepared to be definitive based on JPEG'd low-res slides. The "full resolution" wireframe appears later at around 57:36. I can still see triangles, despite the IQ mess.

There's already too many triangles for the mountains, and they say that they cranked down the tessellation for the audience to see the wireframe. Adaptive tessellation is used, but there's no adaptation for patches that have no chance of generating silhouette triangles (and should thus use non-occlusion parallax mapping).
I don't know if the algorithm you're suggesting is usable with the data they have or whether more is required. But these kinds of refinements are certainly what I'm contemplating when I describe the algorithm as potentially being on the naive side. Cem hints at this when saying that LOD-determination could do more.

Still, I do want to see AMD get a higher rate of setup. Cayman will hopefully do the job if that leaked line of "scalability" holds true.
Yes, exactly my thought.
 
C is angry because he insisted to the end Barts is 6700 series ;)

In fact I don't think he got many things right this gen.

He seems to be loosing his touch and/or contacts.

His article about GF104 was categorically wrong: he missed all the big changes and went for the "GF104 is 3/4 of a GF100, with same structure, nothing more" and "336 cores is just a software bug" :D

His article about GeForce GTX580 was purely based on numbers and speculation. There was not one sentence on that text that seemed to be based on actual knowledge.
 
If the problem is data flow of the hardware implementation than wouldnt it be faster to actualy running the whole thing as a decent software tesselation :?:. (especialy for games where u use it for just a single thing like landscape simulation)
The nvidia demo instanced tesselation (http://developer.download.nvidia.com/SDK/10.5/direct3d/samples.html#InstancedTessellation) is runing vsynced 60 fps (god knows whats the real fps) at any tesselation level up to the 32 max on my 4850. :!:
If you remember HD2900XT's launch and all the demos and presentations, there was a demo of tessellated hills, which were very similar to the hills in HAWX2. This demo didn't even use adaptive tessellation, wireframe was quite dense and it ran around 80 FPS on the HD2900XT. I'm not sure, but I think these tessellated hills (probably optimized by adaptive tessellation?) were used in the Ruby demo and very similar ones were used in Froblins demo. All these demos ran well even on the 3-4 years old hardware. I see now reason, why the HAWX2 (with almost exactly looking landscape) should be so much slower. Hardware tessellation is faster as long as the code doesn't contain viral optimisations.
 
C is angry because he insisted to the end Barts is 6700 series ;)

In fact I don't think he got many things right this gen.

To be fair, he said that a while ago, and never came back on his words but he wasn't "insisting on it to the end".
 
If you remember HD2900XT's launch and all the demos and presentations, there was a demo of tessellated hills, which were very similar to the hills in HAWX2. This demo didn't even use adaptive tessellation, wireframe was quite dense and it ran around 80 FPS on the HD2900XT. I'm not sure, but I think these tessellated hills (probably optimized by adaptive tessellation?) were used in the Ruby demo and very similar ones were used in Froblins demo. All these demos ran well even on the 3-4 years old hardware. I see now reason, why the HAWX2 (with almost exactly looking landscape) should be so much slower. Hardware tessellation is faster as long as the code doesn't contain viral optimisations.
The tessellation programming model for the HD2900XT is different than DX11 and there is different data flow and thus different bottlenecks. Also, the source patches in the old demo would have been smaller than in HAWX2 because HD2900XT supported up to 16 levels of tessellation vs. 64 for DX11.
 
He seems to be loosing his touch and/or contacts.
I guess Charlie's contacts were under greater pressure not to leak anything, because lets face it - nobody in media knew whats going on, hell were was confusion over NI/SI in AMD itself :LOL: The amount of secrecy AMD pulled off was like nothing I've seen before.

So, is the assumption of Cayman will be a marked improvement in tessellation?
According to AMDs slide, there will be an improvement, how much we have no idea. Still scalability and off-chip buffering sounds interesting.

http://extrahardware.cnews.cz/files...ijen/barts_radeon_hd_6800_arch/tess_small.png
 
brain_stew have a slipstreamed driver here and in the same post there is a registry setting you can check if it doesn't work:
http://forum.beyond3d.com/showpost.php?p=1485754&postcount=54

You find it under AA in the 3D section of CCC:
mfaa.jpg


@Dave Baumann: Any chance of releasing the capture tool, so we can make full screen shots of MLAA? :)

Got it thanks :)
 
No problem. :) Should you run into problems with the hack, keep trying. Its really worthy to have MLAA option in my opinion. Check the other thread for screenshots. :D

I'm running it, Liking it, only downside that I can think of is AA'ing text.:

343n8zl.jpg

2nlr2ic.jpg


I just forced CCC to 8xMLAA and haven't looked back, it's awesome stuff, realy!
 
Either devs will have to use special preperated fonts for MLAA on, or have an option to mark certain areas of the screen (text boxes, menus, HUDs) not to have the MLAA applied - possible?
 
Either devs will have to use special preperated fonts for MLAA on, or have an option to mark certain areas of the screen (text boxes, menus, HUDs) not to have the MLAA applied - possible?

MLAA is a post-proces shader, like the HDR shader of old with R300, it's all or nothing.

(yes, I used an HDR shader when running something like a C64 emulator.)
 
MLAA is a post-proces shader, like the HDR shader of old with R300, it's all or nothing.

(yes, I used an HDR shader when running something like a C64 emulator.)

o_O What's the point of an HDR shader when running a Commodore 64 emulator?!
 
Either devs will have to use special preperated fonts for MLAA on, or have an option to mark certain areas of the screen (text boxes, menus, HUDs) not to have the MLAA applied - possible?

Actually, devs should be applying proper AA and MLAA in their games directly before they add-in their HUD elements. Unfortunately not all game devs or publishers care about image quality.
 
MLAA is a post-proces shader, like the HDR shader of old with R300, it's all or nothing.

(yes, I used an HDR shader when running something like a C64 emulator.)
True, but you can define the patterns and how much blending it applies. Its very young technique, original algorithm done by Intel just a year ago, so it will only improve. There should be a way to distinguish the text and process it differently or ignore altogether, either from developer side or by a very sophisticated algorithm in drivers. Its possible there will be different levels of MLAA to choose from, like we have now with AA.

Since MLAA is pretty much the future and "default" AA method for consoles and will play a big part in PC's, all major corporations are working on it as we speak - Intel, Microsoft, Sony, AMD, you name it. And while it might have caught Nvidia off guard, they probably dabbled with it a bit earlier, and since AMD is taking market with a storm, Nvidia will most definitely work overtime to get this features for its customers ASAP. I know some die-hard Nvidia fans who are jumping the ship for 6800 series, just to get MLAA ;) Its remarkable how fast it was adopted for PS3 and now by AMD, NV shouldnt take long either.

If anyone interested in original MLAA work by Intel:
http://visual-computing.intel-research.net/publications/papers/2009/mlaa/mlaa.pdf
 
Back
Top