AMD: R7xx Speculation

Tchock · Mar 19, 2008

ShaidarHaran said:
Yet just about every game on the market runs faster on NV hardware, so one corner case where R6xx runs on-par with G8x/9x isn't indicative of the average case at all.

R6XX's tripped by lack of texture filtering and Z fill more than ALU limits for the "average cases", so even if that comes by, I still don't see where aaron's ALU-bound 50% comes in.

mczak · Mar 19, 2008

Dave Baumann said:
And he was specifically talking about shader bound cases, not average cases.

Yes exactly. Those average cases also tend to run at something like 80% of the performance of a 8800GT on a 9600GT, so apparently they aren't really shader bound. Now, I don't doubt there could be cases where the rv670 shader alu would have a lower efficiency but the point is there's not really any real world case out there which would show this. Just saying it's only 50% as efficient by _assuming_ real world benchmarks are shader bound when they are not in reality is not sufficient...

Mat3 · Mar 19, 2008

aaronspink said:
I think all the marketing Nvidia needed was 20+% more performance with 80% of the resources.

What do you mean 80% of the resources? Are you talking about transistor count or something else?

Sound_Card · Mar 19, 2008

It's impossible to say. But either way, I'm convinced that Nvidia will shortly be wishing things had gone differently. Because even a company with a recent track record as poor as AMD's becomes a threat given enough time.

Specifically, that threat takes the form of AMD's upcoming RV770 high end graphics chip, very likely to be sold under the Radeon HD 4800 brand. It might just be a corker.

Based on the latest 55nm production process, the chip is expected to address most of the weaknesses of AMD's current Radeon HD 3800 family. For starters, I hear that the GPU's texture unit count has been doubled from 16 to 32 units, solving the existing HD 3800's most obvious flaw......

Best of all, it looks like it won't be long before we find out for sure. Early engineering samples of RV770 has been spotted in the wild. It could be on sale before summer is in full swing.

Techradar.com

He also wrote on about 5 SIMD arrays containing 160sp total. Personally I think it's a journalist witting his opinions on the current rumor mills. However, the theory still stands. I do think it's a little ridiculous imagining a RV with this kind of power, but then again, has not the purpose of the RV code name changed in meaning? Not exactly representing value anymore per say but representing a single GPU solution? In my honest opinion though, I still doubt these specs, as they seem more reasonable with R700, but I also can't help but imagine the possibility.

So the SIMD arrays would be doubled in size with a additional SIMD array added on. That means a 1 to 5 ALU:TEX ratio and perhaps a additional back end pumping out 20 pixels per clock as opposed to R600's 16? I would think the die size of this would be much closer (</>) to 300mm2 than the past rumored 252mm2.

Shtal · Mar 19, 2008

Sound_Card said:
Techradar.com
So the SIMD arrays would be doubled in size with a additional SIMD array added on. That means a 1 to 5 ALU:TEX ratio and perhaps a additional back end pumping out 20 pixels per clock as opposed to R600's 16? I would think the die size of this would be much closer (</>) to 300mm2 than the past rumored 252mm2.

Maybe RV770 will be on 45nm instead 55nm - nobody knows for sure and if so then it explains the rumors of 252mm2 with 160SP's.

Jawed · Mar 19, 2008

mczak said:
I don't buy that. It may be a bit more efficient but I doubt it's really that much.

20-30% on a good day with a following wind, I'd say. You need predominantly serial-scalar operations with minimal transcendentals/texturing. Transcendentals and texturing both increase overall utilisation of the ALUs (MI is an expensive unit) but both increase the chances of putting bubbles into the MAD lane.

If you quote "shader bound cases" I'll point you at the 3dmark06 perlin noise test (which, coincidentally, is pretty much the only benchmark which shows scaling about what you'd expect between G92 and G94 so it's probably REALLY shader alu bound). In this benchmark, a HD3870 runs neck and neck with a 8800GTS-512, so it doesn't look that much less efficient (based on peak MAD rates the HD3870 should just be a tad bit faster).

Yep this is truly an ALU-limited shader, 9.31:1 ALU:TEX ratio (in the D3D assembly).

And, as I posted before, it runs at 93% scalar utilisation on R6xx (197 instruction slots - 916 scalar operations - 4.65 scalars per instruction slot) - ignoring the TEX instructions, that is.

So NVidia's design is theoretically ~8% faster here, if you ignore dependencies across the MAD and MI units.

More fundamentally, I think NVidia's ALU design is extremely costly in a number of areas:

Register file:

G92 has 16 register files
RV670 has 4 register files (though I think each one prolly has a ghost copy in order to support operand fetch bandwidth)

When a shader uses lots of registers (say more than about 4 vec4s), NVidia's design suffers from a severe drop in "occupancy" due to the relatively small size of each register file. So performance hits a brick wall, much like register file pressure affected performance in NV40...G71. Decoupled texturing in G92 significantly lowers the costliness of low thread counts, though.

Thread issue:

G92 scoreboards every operand for each instruction in a shader - "is r0.z ready to issue a MAD?"
RV670 scoreboards texture-clauses ("have these 3 texture instructions produced their result yet?")

NVidia's design hugely increases the amount of per thread state data (since it all needs to be scoreboarded) which also means the instruction/thread issue logic is pretty complex. Registers have read-after-write latency that can lower the population of available threads, something that never affects RV670.

Branching:

G92 is forced to flush the ALU pipeline (~ 8 clocks?) when a thread turns out not to need to run that clause - if the clause is at least 5 or 8 instructions long (compiler makes a guess about the threshold for clause flushing)
RV670 never flushes its ALU pipeline due to branching, it always swaps the thread for this test - stalls only arise when available threads are entirely exhausted

This hurts most in code that loops an indeterminate number of times, I guess (variable loop count per object within a thread). Arguably this is a style of coding that's rare, so doesn't matter.

While there are advantages in NVidia's design:

Instruction-issue:

G92 issues 2 operations per processor
RV670 issues 5 operations

G92 has relatively simple ALU compilation aided by the fact that attribute-interpolation increases opportunities to maximise ALU utilisation. The requirement to schedule attribute-interpolation instructions does make compilation more complex, of course. RV670 is comparatively easy to run at very low utilisation with serially-dependent scalar instructions - but such code doesn't make up the majority of graphics shaders.

Texture (memory fetch) latency hiding:

G92 issues all texture operations independently
RV670 prefers to issue texture operations in clauses (e.g. 4 TEX operations)

This reduces the number of threads that G92 needs, per SIMD, in order to hide texturing latency, which has a knock-on effect of lowering register file consumption.

Register file usage:

G92 serial-scalar instruction issue
RV670 5-way instruction issue - but the ALU pipeline contains 5 scalar registers per object

G92 can use less registers in the compiled shader by re-ordering instructions (e.g. splatting vector instructions such that a single scalar register can be used as "scratch" for all channels of the vector result over the duration of the vector instructions). Though RV670's pipeline registers (effectively a mini register file of 8 clocks * 5 scalars per processor * 64 objects = 2560 scalars, 10KB) also means it can reduce consumption of the register file for purely "scratch" registers (results that are only needed for one "clock" - actually 8 clocks of pipeline time).

Branching granularity:

G92 has a basic thread granularity of 16 - though this doesn't apply to pixel shading (32) and will prolly be 32 in future designs
RV670 has thread granularity of 64. Future designs can either be smaller or larger. I'd tend to expect larger.

Comparatively RV670 hurts on all shader code that allows incoherent branching, with the worst effect seen in geometry or vertex shaders.

Ultimately NVidia saved ALU die space by running at 2x (plus) clock rates when compared with the GPU's core clock. Comparatively I think this is ATI's key drawback in terms of ALU implementation (not "serial scalar") while I think the simplicity of thread-control and the high thread:register-file ratio means ALU capability will scale very rapidly.

Jawed

Jawed · Mar 19, 2008

Sound_Card said:
So the SIMD arrays would be doubled in size with a additional SIMD array added on. That means a 1 to 5 ALU:TEX ratio and perhaps a additional back end pumping out 20 pixels per clock as opposed to R600's 16? I would think the die size of this would be much closer (</>) to 300mm2 than the past rumored 252mm2.

There's nothing there that we haven't already discussed and it still seems very unlikely. It would need to consist of lots of "custom" logic (in the sense of NVidia's ALU design, in order to save die space) for that kind of scaling-up.

The 5:1 ratio sounds right to me (ATI always wants to increase it), but everything else sounds wrong

In my view there's no point increasing the colour fillrate, so RBEs would stay at 16. The key question is Z rate.

Jawed

Sound_Card · Mar 19, 2008

Jawed said:
There's nothing there that we haven't already discussed and it still seems very unlikely. It would need to consist of lots of "custom" logic (in the sense of NVidia's ALU design, in order to save die space) for that kind of scaling-up.

The 5:1 ratio sounds right to me (ATI always wants to increase it), but everything else sounds wrong

In my view there's no point increasing the colour fillrate, so RBEs would stay at 16. The key question is Z rate.

Jawed

Hey, I'm not saying it's true as I highly doubt it my self. I'm still a firm believer in the 96sp's.

trinibwoy · Mar 19, 2008

Jawed said:
In my view there's no point increasing the colour fillrate, so RBEs would stay at 16. The key question is Z rate.

While going through the TR GX2 review I noticed quite a few occurrences of the 9600GT/8800GT being close yet significantly faster than the 3870/3850. One explanation would be that the 8800GT is so bandwidth limited that its shader and TMU advantage over the 9600GT count for nought. Another is that AA sample-rate and/or Z-fillrate of (pretty much equal across all G9x cards) is having a big impact.

Jawed · Mar 19, 2008

trinibwoy said:
While going through the TR GX2 review I noticed quite a few occurrences of the 9600GT/8800GT being close yet significantly faster than the 3870/3850. One explanation would be that the 8800GT is so bandwidth limited that its shader and TMU advantage over the 9600GT count for nought.

Seems like the most sensible conclusion to me. There are times, though, when 8800GTS-512 is ~50% faster, but I've not seen more than that.

Another is that AA sample-rate and/or Z-fillrate of (pretty much equal across all G9x cards) is having a big impact.

I can't work out what you're saying here.

Jawed

trinibwoy · Mar 20, 2008

Yeah I knew that phrase wouldnt go over well the way I worded it

Just saying that the AA sample rate and Z fillrate of G92 and G94 are similiar and at the same time considerably higher than RV670 so those are other potential reasons why they can have similiar performance while AMD's parts lag behind.

(I'm pretty sure G9x has a 4x Z advantage per-clock but I'm not sure whether RV670 is capable of 4 or 2 AA samples per-clock.)

Sound_Card · Mar 20, 2008

It would be 2xaa per clock on Rv670.

EDIT: So I expect considerable increase in depth/stencil units in RV770.

Jawed · Mar 20, 2008

trinibwoy said:
(I'm pretty sure G9x has a 4x Z advantage per-clock but I'm not sure whether RV670 is capable of 4 or 2 AA samples per-clock.)

4xMSAA per clock, but 8xZ per clock on G9x versus 2x per clock for both MSAA and Z on RV670.

Jawed

Mobius1aic · Mar 21, 2008

In shader benchmarks were not the R6xxs beating it's Nvidia rivals? ATi's cards have noticeably less TMUs as well and that'd be my guess to why it's losing out in real world conditions.

aaronspink · Mar 22, 2008

Dave Baumann said:
And he was specifically talking about shader bound cases, not average cases.

specifically I was referring to crysis and HL2:E2. Synthetics aren't really that interesting to me atm. Esp synthetics in high visibility benchmarks!

Aaron Spink
speaking for myself inc.

aaronspink · Mar 22, 2008

Mat3 said:
What do you mean 80% of the resources? Are you talking about transistor count or something else?

ALU ops per second.

aaronspink · Mar 22, 2008

Jawed said:
There's nothing there that we haven't already discussed and it still seems very unlikely. It would need to consist of lots of "custom" logic (in the sense of NVidia's ALU design, in order to save die space) for that kind of scaling-up.

The 5:1 ratio sounds right to me (ATI always wants to increase it), but everything else sounds wrong

In my view there's no point increasing the colour fillrate, so RBEs would stay at 16. The key question is Z rate.

Jawed

If there is any part of a graphics design that should be as custom as possible, its the ALU pipes themselves. Its very very high bang for the work, basically 1 ALU and then just slicing it across for the SIMD block and then replication for the multiple SIMD blocks. Its also something that is relatively easy to layout and extremely regular.

In addition synthesis tools tend to suck balls at ALU pipes where as they are much closer to hand drawn for random logic.

I've been a big proponent of all ALUs being full custom. Esp in light of the way Nvidia did it allowing 2x alu rates.

Aaron Spink
speaking for myself inc.

Farhan · Mar 22, 2008

aaronspink said:
If there is any part of a graphics design that should be as custom as possible, its the ALU pipes themselves. Its very very high bang for the work, basically 1 ALU and then just slicing it across for the SIMD block and then replication for the multiple SIMD blocks. Its also something that is relatively easy to layout and extremely regular.

In addition synthesis tools tend to suck balls at ALU pipes where as they are much closer to hand drawn for random logic.

I've been a big proponent of all ALUs being full custom. Esp in light of the way Nvidia did it allowing 2x alu rates.

Aaron Spink
speaking for myself inc.

AFAIK the nvidia ALUs aren't full custom. They were still synthesized.

CarstenS · Mar 22, 2008

Is it true, that by designing the ALUs themselves (w/o register file and control logik) very lean, you also get so save a lot of mass transistors to get a clean consistent clock signal across the ALUs? So you can reach higher freqs without having to make the whole thing even bigger?

Farhan · Mar 22, 2008

Quasar said:
Is it true, that by designing the ALUs themselves (w/o register file and control logik) very lean, you also get so save a lot of mass transistors to get a clean consistent clock signal across the ALUs? So you can reach higher freqs without having to make the whole thing even bigger?

A good full custom design can be much faster (and smaller) than a synthesized design, especially for datapaths, but probably not because of better clock distribution, since that is more of a function of your clock distribution network (and i think tools should be able to balance paths reasonably well, but of course i've never used them before. full custom ftw!

). You could argue that making the ALU smaller makes it easier for clock distribution, but on the scale of a single ALU i doubt that is a huge issue unless you are really pushing the limits in terms of clocking.

AMD: R7xx Speculation

Tchock

mczak

Mat3

Sound_Card

Shtal

Jawed

Jawed

Sound_Card

trinibwoy

Meh

Jawed

trinibwoy

Meh

Sound_Card

Jawed

Mobius1aic

Quo vadis?

aaronspink

aaronspink

aaronspink

Farhan

CarstenS

Moderator

Farhan

Similar threads