AMD: R9xx Speculation

rpg.314 · Jul 18, 2010

Conceptually, isn't distributed setup/raster another way of doing TBR. Or atleast a beginning of migration towards TBR?

rpg.314 · Jul 18, 2010

Silent_Buddha said:
I highly doubt much is going to be done about geometry performance for SI. I think it's far more likely that NI might do something with regards to that as it's supposed to feature more radical changes (when both are compared to Evergreen).

If you are not expecting changes in geometry processing, what other changes are you expecting from SI/NI/whichever thing comes this year.

CarstenS · Jul 18, 2010

rpg.314 said:
Conceptually, isn't distributed setup/raster another way of doing TBR. Or atleast a beginning of migration towards TBR?

I wouldn't say so necessarily. It's just able to lessen a serial bottleneck which would greatly benefit also any TB(D)R.

rpg.314 · Jul 18, 2010

CarstenS said:
I wouldn't say so necessarily. It's just able to lessen a serial bottleneck which would greatly benefit also any TB(D)R.

Ad it does that by performing spatial binning of primitives right? So may be not a full blown TBR, but IMHO, fermi is laying foundations for TBR to come back to desktop.

Jawed · Jul 18, 2010

CarstenS said:
What I've been pondering on lately is the following:
Everyone seems to be assuming that AMD is going to rectify their geometry performance and best/match/come close to what Fermi is offering.

There are a few arguments that make me wonder if this is really a priority at AMDs.

First, they seemed to be quite content with the tessellation performance, when designing their top three chips with the same ff-hardware. And arguably it doesn't seem like a major bottleneck in currently shipping DX11 titles (yes, actual games that is).

The setup-rasteriser architecture is explicitly large-triangle friendly, not small-triangle friendly. It needs a complete overhaul for future scaling. While it appears adequate for games - that's mostly because it's early days, I reckon. And lack of analysis.

Second, Nvidia was pretty long rumored to be doing "soft-tessellation" implying rather lackluster performance. Now, obviously I don't know if AMD itself was misled by that also, but since even semiaccurate and before that the inquirer kept trumpeting how abysmal GF100 was to perform when faced with DX11 workloads, I'd consider the possibility at least.

I'm waiting for a decent analysis of the behaviour of Fermi architecture here.

Third, Nvidias geometry performance wasn't really something to boast about before Fermi. IIRC before the new architecture, Nvidias chips were capable of 0.5 drawn triangles per clock, whereas at least higher end Radeons could achieve a theoretical ratio of 1.0. This also doesn't really point into the direction, that the Santa Clarans were about to invest really heavily into this area.

That difference didn't really cost NVidia anything. I think that's partly because ATI rasterisation performance in games is poor these days and partly because NVidia had a Z rate advantage - though not necessarily being used very well.

More analysis of frame-rate minima is needed.

Fourth, according to Nvidia, the distributed tri-setup and raster grew the whole GF100 chip by 10 percent. Now, that's probably marketing, but I tend to believe that it wouldn't be quite as cheap as single-digit square millimeters to incorporate that feat. Talking of which, they seemed to be quite proud of having succeeded at all, so it's probably no minor task you can throw in into a largely defined chip.

How big is the setup-rasteriser in GT200? Also, how much of that growth was caused by improved early-Z culling, screen-space tiling, etc.? In other words, 10% isn't very meaningful :???:

The question I am asking is, how likely it would be that AMD is willing to invest major ressources into a feature today mainly used for Unigine, Stone Giant and some SDK samples. I really cannot assess how upcoming games are going to really stress tessellation performance, but out of the few currently available DX11 games, I think Battleforge and BF: Bad Company 2 don't use tessellation at all.

Evergreen had features axed in order to launch on time. Sure, any chip gets features axed, in theory, to launch on time.

I think there are 4 key areas of change in R700->Evergreen:

ALU utilisation - a real DOT3, pairs of DOT2 and various friends (instructions with PREV in ISA name), improved precision and general flexibility
thread-generation/interpolation/LDS - interpolation is dependent upon LDS and thread generation used to be a sub-function of interpolation (or interpolation used to be a sub-function of thread generation - doesn't really matter)
tessellation - support for HS and DS, a dedicated TS stage for D3D11 (distinct from Xenos/R600 style TS)
scatter/gather/atomics - memory operation performance freed from TUs and ROPs

To put it bluntly, all of these are failures:

there is no need to have kept 5-way SIMD in my view, 4-way is clearly preferable (>80% utilisation is rare)
thread generation is still bound by rasterisation rate it seems - multiple DirectCompute kernels can execute concurrently which helps, but thread generation for small triangles is fucked I reckon
it's not going to scale, it's designed for a slow rasteriser (see 2.)
scatter/gather is still a second-class citizen

Options:

There's a few choices for VLIW reorganisation. Ultimately it's intricately tied to register-file and TU organisation/operation. One issue is that as the GPU architecture moves closer to being re-used in Fusion, double-precision can not remain optional, i.e. has to be optimal for all ALU implementations. But that could be 5 years away. How much is Fusion supposed to trail GPU? 18 months? Also, what clock rates are required for Fusion i.e. when tighter integration of GPU-ALU processing comes, is 1GHz suitable when the CPU is running at 3GHz?
and 3. go together. LDS seems fine, though the compiler guys are really struggling to make it work well (similar story for interpolation).
Rasterisation is inherently parallelisable at the granularity of a quad. Hierarchical-Z/early-Z makes hierarchical-rasterisation preferable, but that process is the slave of one triangle at a time per quad, due to fundamental ordering-constraints. So hierarchical-Z either needs to be shallower (with smaller screen-space tiles) or a delayed-commit early-Z system needs to be implemented.
There's an argument that if rasterisation of small triangles is done quickly (so that a hardware thread can be populated with <=16 triangles' fragments in <=4 cycles) then early-Z becomes redundant at least some of the time. This is on the basis that it's faster to rasterise, shade and late-Z cull than it is to delay a hardware thread until it's fully populated with 16 triangles by a slow rasteriser/hierarchical-Z unit (or to run a 16-quad hardware thread with only 1 quad of fragments active). Best for short shaders/Z-prepass/shadow-buffer-rendering.
A joker in the pack is sample-frequency shading (a feature of D3D10.1), i.e. shading per MSAA sample, not per fragment - this naturally makes all triangles typically 4 times bigger
once the hardware can cope with lots of small triangles then it'll be worth making a fast TS.
GDS in Evergreen (an enhanced version of GDS from R700 and presumably R600) is a bottleneck for TS operations.
Also, triangles always "exit" a core, so there's a wodge of data moving large distances. NVidia's design minimises the movement of triangle data (though doesn't eliminate it - depends on the screen-space tiles that a triangle ends-up being rasterised over), i.e. L2 is a bottleneck for some triangles. TS is trivially parallelisable per patch, though a patch can generate a vast output - cache ahoy.
I still wonder if setup could be implemented as a GS kernel (not much different to the fixed function interpolators that are now run as a kernel). Since GS in ATI is, sometimes, dependent upon buffers in global memory, caching for global memory would be able to underly GS and keep setup's data on-die.
memory controllers in Evergreen still seem to be ROP/TU centric. There appears to be no meaningful coalescing, as it appears to be available only when a scatter/gather has no incoherency whatsoever: black/white - detail is scant though. Cache ahoy?

I'm really puzzled why 1. didn't happen in Evergreen. Maybe that was purely workload, rather than 40nm-problems cut-back?

As for the rest, they're all major changes. Then it's a matter of whether the architecture undergoes creeping-featurism or one final radical upgrade.

I tend to think it's creeping-featurism.

I suspect 2 and 3 are dependent upon 4, because data-paths/cache-hierarchy need to be made robust, something that NVidia did a good job of in Fermi. 3 doesn't seem particularly difficult (and I don't see anything wrong with making TS a kernel, for what it's worth). 4 could be done without doing 2 and 3, leaving them for later. Hell, 3 could be done even if 2 isn't (the presence of two distinct TS units screams kludge to me).

Jawed

neliz · Jul 18, 2010

HD6870 benchmark performance

If you laugh before you close your browser window, you lose!

AlexV · Jul 18, 2010

Jawed said:
The setup-rasteriser architecture is explicitly large-triangle friendly, not small-triangle friendly.
Jawed

What exactly are you basing this on? If it's Damien's numbers, perhaps you shouldn't. Cypress can hit near its theoretical rate with 1 pixel tris just fine, and it will be screwed by large triangles that straddle across screen tiles, thus making raster the limiting factor. The story with tessellation is a bit more ample, and the really awful 1 tri per 3 clocks behaviour is not, in spite of what has been alluded, necessarily the norm (albeit there seems to be a fixed cost attached to enabling tessellation, which is independent of tessellation factor/triangle size).

BRiT · Jul 18, 2010

neliz said:
HD6870 benchmark performance

If you laugh before you close your browser window, you lose!

What if you laughed before visiting the link?

rpg.314 · Jul 18, 2010

neliz said:
HD6870 benchmark performance

If you laugh before you close your browser window, you lose!

What a pointless waste of time. Prolly xxx is running this site.

Novum · Jul 18, 2010

keldor314 said:
Flight simulators need to use double precision because the size of the earth is large enough that single precision is only accurate to +- 1 meter or so at one earth radius from the origin. This is fine for mountains and such, but what about simulating a country airstrip, where the runway is almost-but-not-quite level? A bump that's merely an inch high is quite noticeable when you're rolling over it at 100 MPH during takeoff...

It's very unlikely that existing FS do that, because you couldn't render with DP precision until very recently.

Jawed · Jul 18, 2010

AlexV said:
What exactly are you basing this on?

Patents which I've linked and discussed + very strong recommendations from AMD devrel not to tessellate below 8-fragment triangles + the stunningly awful performance in non-game tests.

The instant a triangle falls below 4 quads in size and occupies only one screen space tile, one rasteriser is idle. This kills performance on z-prepass and shadow buffer rendering.

Tessellation performance?

http://www.hitechlegion.com/reviews/video-cards/4742-evga-geforce-gtx-460-768mb-video-card?start=18
http://www.hitechlegion.com/reviews/video-cards/3177?start=17

See the SubD11 sample result at the bottom of those pages. This is something that AMD demonstrated running on Juniper over 1 year ago. I bet the guys at NVidia had a chuckle when they saw the performance.

NVidia's water demo is comparatively kind

Jawed

CarstenS · Jul 18, 2010

Jawed said:
The setup-rasteriser architecture is explicitly large-triangle friendly, not small-triangle friendly. It needs a complete overhaul for future scaling. While it appears adequate for games - that's mostly because it's early days, I reckon. And lack of analysis.

Tell me, did we leave "the early days" of DX10's Geometry Shader? (I don't have to spell out the analogies here, have i?).

Jawed said:
I'm waiting for a decent analysis of the behaviour of Fermi architecture here.

Apparently, people associate bad-ass slowliness with "in software" and that's what i was talking about - not whether or not Fermi might use transistors for other stuff than the tessellation stage.

Jawed said:
That difference didn't really cost NVidia anything. I think that's partly because rasterisation performance in games is poor these days and partly because NVidia had a Z rate advantage - though not necessarily being used very well.

Frankly, I have no idea what you're saying here. Which difference? Whose raster perf is poor? And which z-rates are poorly utilized?

Jawed said:
How big is the setup-rasteriser in GT200? Also, how much of that growth was caused by improved early-Z culling, screen-space tiling, etc.? In other words, 10% isn't very meaningful

Since only multiplying setup/rasterizer doesn't get you anywhere if you do not also reinforce the necessary infrastructure… And to do it properly, you'll have to walk the painful way, I guess.
Anyways, I had the same question and they said: 10% more compared to an approach analogue to GT200/RV790.

Jawed said:
I think there are 4 key areas of change in R700->Evergreen:[/COLOR]

You forgot one very important key change: The number of units. Granted, it's a rather obvious thing, but if you have a performance, cost and yield target, you also have to factor in, exactly how many of the engineer's dreams you can incorporate into the new design in order to meet these goals.

Drazick · Jul 18, 2010

Entropy said:
Most MatLab users are students who run it on laptops. How many MatLab users are free to both buy and configure their systems as they please, and of those who do (I do) how many would choose to configure it with a top-end videocard (I wouldn't)? And how do those numbers compare to the number of people buying graphics cards to play games?

It makes sense to optimize your product to fit your market - it allows higher performance/lower power draw at lower cost, benefiting customers while still allowing healthier margins and greater market flexibility.

It's Egg Chicken Circle.
Once they give the performance and capabilities there'll be programs -> applications who would take advantage of that.

I'm not really sure(Don't have any real knowledge in that), but does Video Encoders use DP or do they us SP?

I don't know, for me as a user who doesn't play any game I'm eager to see some applications who would benefit from the GPU, Something beyond gaming.
In much larger scale.

3dcgi · Jul 18, 2010

Jawed said:
thread generation is still bound by rasterisation rate it seems
Jawed

Only the pixel shader thread generation is bound by rasterization. Compute is not.

eastmen · Jul 18, 2010

aaronspink said:
69k assuming an average die size mix of ~250mm2 and 90% average yield.
78k assuming an average die size mix of ~250mm2 and 80% average yield.
90k assuming an average die size mix of ~250mm2 and 70% average yield.
104k assuming an average die size mix of ~250mm2 and 60% average yield.
125k assuming an average die size mix of ~250mm2 and 50% average yield.
156k assuming an average die size mix of ~250mm2 and 40% average yield.
207k assuming an average die size mix of ~250mm2 and 30% average yield.

So pick a yield and scale by what you assume the average die size is. I've already factored in a 10% wafer trim factor (basically unusable space on the wafer which is a combination of geometric issues and min spacing issues).

Thanks for all that information and work . Really cool. I'd personaly love to have a discarded wafer with bad chips on it. I'd frame it and hang it up on my wall. They are really beautiful things. I've only see two wafers that were acctually used though.

Dave Baumann · Jul 19, 2010

CarstenS said:
Anyways, I had the same question and they said: 10% more compared to an approach analogue to GT200/RV790.

Not sure what NVIDIA would know about the "RV790" size / approach. The simple fact of the matter is that NVIDIA had to make a larger change to their architecture to support Tessellation simply because they didn't have it before, whereas it has been ingrained in our desigsn for multiple generations.

aaronspink · Jul 19, 2010

Drazick said:
I'm not really sure(Don't have any real knowledge in that), but does Video Encoders use DP or do they us SP?

They primarily use int. They use FP but generally only as control parameters for things like quality metrics, rate control, etc. Most of the actual data stays in the fixed point domain.

rpg.314 · Jul 19, 2010

Drazick said:
I'm not really sure(Don't have any real knowledge in that), but does Video Encoders use DP or do they us SP?

Video decode/encode is almost exclusively integer ops based. And 8bit and 16bit ops dominate there.

AlexV · Jul 19, 2010

Jawed said:
Patents which I've linked and discussed + very strong recommendations from AMD devrel not to tessellate below 8-fragment triangles + the stunningly awful performance in non-game tests.

The instant a triangle falls below 4 quads in size and occupies only one screen space tile, one rasteriser is idle. This kills performance on z-prepass and shadow buffer rendering.

Tessellation performance?

http://www.hitechlegion.com/reviews/video-cards/4742-evga-geforce-gtx-460-768mb-video-card?start=18
http://www.hitechlegion.com/reviews/video-cards/3177?start=17

See the SubD11 sample result at the bottom of those pages. This is something that AMD demonstrated running on Juniper over 1 year ago. I bet the guys at NVidia had a chuckle when they saw the performance.

NVidia's water demo is comparatively kind

Jawed

Patents are nice and all, but their materialization in hardware is an unknown quantity. nVidia tends to reccomend the same thing with regards to not going under 8 pixel triangles in general - does that mean they're small triangle unfriendly(hint, Fermi's epic setup rate happens with really small triangles, so if that were the only consideration that'd be what they'd want always)?

Do you base your definitive statement about what happens the instant a triangle falls under a particular area on actual experience with the hardware? If so, please detail it, because that's definitely what I (and others) are seeing in practice. Tessellation means a shitload more than setup - raster, and there are multiple potential sticking points with regards to data flow that can and apparently do hamper Cypress performance, or rather, expose the parts where it has a less graceful performance decline compared to Fermi. I don't think that we should be using SDK samples that are written for readability rather than performance, and which do quite a few things, to underline specific architectural traits. If we want to talk about setup/raster, we need to use something that isolates those portions as well as possible, and start from there.

CarstenS · Jul 19, 2010

Dave Baumann said:
Not sure what NVIDIA would know about the "RV790" size / approach. The simple fact of the matter is that NVIDIA had to make a larger change to their architecture to support Tessellation simply because they didn't have it before, whereas it has been ingrained in our desigsn for multiple generations.

Yes, of course. And of course their 0.5-triangles-drawn-per-clock-approach would have made them look REALLY bad in tessellated workloads. OTOH, their knowledge of your chips is - IMHO - second only to your own (and vice versa). Both companies should have tools for analyzing chips normal people like us can only dream of.

AMD: R9xx Speculation

rpg.314

rpg.314

CarstenS

Moderator

rpg.314

Jawed

neliz

GIGABYTE Man

AlexV

Heteroscedasticitate

BRiT

(>• •)>⌐■-■ (⌐■-■)

rpg.314

Novum

Jawed

CarstenS

Moderator

Drazick

3dcgi

eastmen

Dave Baumann

Gamerscore Wh...

aaronspink

rpg.314

AlexV

Heteroscedasticitate

CarstenS

Moderator

Similar threads