AMD: R7xx Speculation

Status
Not open for further replies.
how much of deficits, the 8800 gt is also in that graph and it has ~35% less bandwidth compared to the 8800 gtx and its performance is only 5% less at the highest resolution tested. To expect maricles with the 4870 just because of bandwidth in Crysis, I just can't seem to see where that is coming from.
What miracles? The graph you called fake has the 4870 only 21% faster than 4850 in the Tweaktown review you linked. The 4870 CF score is only 27% higher than the 4850.

Only the 2560x1600 scores are messed up, but there the card is running out of memory and all bets on reproducability are off. Tweaktown even said they're using a custom demo.

What do you think people are expecting from the 4870? Most are thinking 35% at best. The only BW scaling needed for that is 12.5% more perf with 50% more BW. Is that so hard to believe? We see similar things with G80 vs G92.
 
well thats why I stated something is wrong with the scaling of those graphs, I didn't say it should be less or more did I? Also noted that the single card scores are off. And that goes back to Morgoth's idiotic commentary.....

well excluding the 2x res, thats of course memory limited.
 
Last edited by a moderator:
Assuming two register operands allowed to be read per instruction, A's SIMD reads 10 groups of 16 FP32/s per clock. In B, the SIMD reads 2 blocks of 64 FP32's every clock and 2 more every 4 clocks for the transcendental units. "Live registers" is the same.
Just by the size of the SIMD read for design B, there might be a physical split anyway, though this would be code-transparent.

Yes.
Otherwise I can't make the same claims as before. I guess it could be 4 groups of 16, but isn't that the same thing?
It's just that if 64 elements are being evaluated per clock, eventually we reach a point where 64 conditional instructions pop up in a single clock.
I guess there could be a way to spread that out over multiple clocks, if the number of branch units is kept at 16.

For clarity, how big is a clause in the way you use the term? Are you talking about a 5x1D instruction packet, or something variable and longer? (Well, not always 5x1D, as tex or branch instruction are possible too, but they don't go into the SIMD's, obviously)
Each instruction packet would take up one instruction slot, of which a clause is defined as a collection of homogenous instructions, which for R6xx was something in the neighborhood of up to 120 slots.
I can't find the document I had that listed the nuances of it.
It's basically a collection of similar type instructions (tex, vertex, or ALU) that is supposed to execute without preemption.
 
well thats why I stated something is wrong with the scaling of those graphs, I didn't say it should be less or more did I? Also noted that the single card scores are off. And that goes back to Morgoth's idiotic commentary.....

well excluding the 2x res, thats of course memory limited.

Still an annoyed little child, sigh. 25% at most?
 
Just by the size of the SIMD read for design B, there might be a physical split anyway, though this would be code-transparent.
Okay, but it's the same size as the SIMD in size A. Looking at the RV770 layout, it does look like 4 groups of 4x(4x1D+1D), but you can do the same in B, i.e 4 groups of 16x1D+4x1D

It's just that if 64 elements are being evaluated per clock, eventually we reach a point where 64 conditional instructions pop up in a single clock.
Why does that matter? The batch size for conditional granularity is 64 elements anyway for ATI since R600. As Jawed said, the sequencer takes care of that, and it's done in the same way.

Are you worried about all 8 batches in design B hitting a conditional one after the other? That can just as easily happen with 4 sequential pairs of batches in Design A as well over the same 32 clocks.

Each instruction packet would take up one instruction slot, of which a clause is defined as a collection of homogenous instructions, which for R6xx was something in the neighborhood of up to 120 slots.
I can't find the document I had that listed the nuances of it.
It's basically a collection of similar type instructions (tex, vertex, or ALU) that is supposed to execute without preemption.
Yup, I started reading the document linked above.

Okay, so we need 8 clauses in flight instead of 2. Big deal. The max. rate at which batches (and thus clauses) go into the SIMD is the same. I was simplifying things with my steady "macrobatch" explanation, but accomodating variable length clauses doesn't make it much different. In design B, the sequence just marches through the 8 active batches - one every 4 clocks - to see if the clause and/or batch needs to be replaced and does so. In design A, it did that at the same rate, but only looked after two at a time. Determining which batch to go next is identical.

Clause throughput isn't any faster, nor is the fetching of "instruction groups" (which is what I called a packet).
 
well thats why I stated something is wrong with the scaling of those graphs, I didn't say it should be less or more did I?
I still don't see what's wrong. The highest resolution has memory issues on Tweaktown, and the lowest isn't even the same resolution (1680x1050 vs. 1280x1024). What scaling are you objecting to? Which pairs of numbers don't scale like you expect?
 
Still an annoyed little child, sigh. 25% at most?


In crysis without AA yes. And if you still want to argue with it, get the game test it out, I've been using this engine for the past 4 months, I'm pretty comfortable with my numbers.

Mint:

at 1920x1200 there is a 47% scaling for the HD 4850 crossfire in the tweaktown review
at the same res there is a 55% scaling for the HD 4870 crossfire in the chart we are talking about.

A) if there was a bandwidth bottleneck the HD 4850, in crossfire would scale better because the bottleneck would be releaved, unless in crossfire there is still enough of a bottleneck on the HD4870 to accomidate the % difference, I find that very hard to believe with 280 mb/sec that there is a bottleneck that severe.
B) scaling is just better with the HD4870 even though they are the same chips
C) Drivers.
D) Performance figures for one card is off.

A has to be true, since no way at the bandwidth of a crossfired HD4870 would cause that
B Only way this is true is if C is the culprit unlikely since the drivers we are seeing are avaiable and no major performance changes have been noted
D well they are off, even for some of the GFs. Last I remember in Dx9 the GX2 actually had a marginal lead over the 280.
 
Last edited by a moderator:
I would caution you to get your head round sections 4.6 and 4.7 of R600 ISA :devilish:
I took a peek, but it's mostly about, well, the ISA. :p

Thinking about it more, I was probably wrong about the GPR simplification with design B, because design A doesn't necessarily have to fetch in quarter-sized groups if it pipelines the register file accesses. Nonetheless, there isn't anything making design B more demanding or complicated in terms of register file access and porting.
 
Mint:

at 1920x1200 there is a 47% scaling for the HD 4850 crossfire in the tweaktown review
at the same res there is a 55% scaling for the HD 4870 crossfire in the chart we are talking about.
OMFG, is that it? :LOL:

That wee difference can be explained simply by a different friggin motherboard! Different timedemos (changing the PCI-e and/or CF-link load) would also give you that discrepancy. Notice that the TweakTown review writes: "Timedemo or Level Used: Custom time demo"
 
thats not it, crysis doesn't scale that well on the radeons :smile:, average scaling is actually lower then the time demo used in the tweaktown benchmark. And that is with the HD4850 in crossfire. It is level dependent, but scalling of 45% is the peak without AA, that acutally goes for SLi too.
 
thats not it, crysis doesn't scale that well on the radeons :smile:, average scaling is actually lower then the time demo used in the tweaktown benchmark. And that is with the HD4850 in crossfire. It is level dependent, but scalling of 45% is the peak.
So what are you saying now? The TweakTown benchmarks are also fake? :LOL:

I'm not wasting any more time with you. I'll bet you $100 that there will be plenty of reviews out there showing the 4870 faster than the GTX 260 without AA in Crysis and other games. Maybe not all, but who cares?
 
So what are you saying now? The TweakTown benchmarks are also fake? :LOL:

I'm not wasting any more time with you. I'll bet you $100 that there will be plenty of reviews out there showing the 4870 faster than the GTX 260 without AA in Crysis and other games. Maybe not all, but who cares?

Why without AA necessarily? Unless we assume VRAM limited scenarios...which will be dealt with:)
 
On my display there's absolutely no benefit with 8xMSAA or any of the CSAA modes. 4xMSAA + supersampled transparencies has been my sweet spot at 1680x1050 on a 20". The popular 1920x1200 24 inchers have a higher dot pitch though so more AA may be beneficial there. Higher contrast displays might also benefit from more filtering.

In light of that I want to perform some tests on the various Adaptive AA modes (TSAA) of the 4850. Unfortunately, neither World in Conflict nor HL2:Ep2 seem to be good tests cases for this (I'll post screenshots in a jiffy to show my assertion).

What would be a good game to test and in what location? Preferably something that comes in downloadable demo form (unless I happen to own the game lol).
 
In light of that I want to perform some tests on the various Adaptive AA modes (TSAA) of the 4850. Unfortunately, neither World in Conflict nor HL2:Ep2 seem to be good tests cases for this (I'll post screenshots in a jiffy to show my assertion).

What would be a good game to test and in what location? Preferably something that comes in downloadable demo form (unless I happen to own the game lol).
HL2 should be a good test of adaptive AA. Look for chain-link fences, I recall those as being much-improved.
 
Status
Not open for further replies.
Back
Top