AMD: R7xx Speculation

Status
Not open for further replies.
its more fillrate bound then memory ;)

ignore the highest res, thats just pointless in those graphs. still looking at 40% vs 55% scaling when memory isn't a hurdle.

And memory isn't a hurdle because you say so? Double the memory and double the bandwidth certainly has the potential to make scaling more linear through higher resolutions on a variety of games.
 
Where exactly are the issues with scaling? You're being incredibly vague.

The only common reference points between the graphs are the GTX 280 (Tweaktown's is overclocked) and a loose comparison between 4850 an 4870, single and crossfire. GTX 280 perf makes sense, and SLI is still scaling better than CF.

WTF are you talking about? Which Vantage scores are you referring to? This is only one game: Crysis. There's nothing inexplicable here.


I already have the vantage scores and crysis scores for single HD4870, looking at those graphs its just wrong! Is that simple enough?
 
MfA, could you answer something that I asked before?

I'm mostly guessing, but I think there are some benefits to the relaxed timing between successive instructions.
Wider vector widths in software versus the physical width in the hardware lessens the number of stall cycles the hardware experiences.

For example, R600 takes 4 cycles per batch instruction, and it alternates between two batches on each SIMD.

That means a given element's instruction on a SIMD will not hit the N+1 instruction for 8 cycles.
If branches can be resolved and the next instruction picked in that time frame, branching won't insert a bubble.
Also, result forwarding isn't needed if the results can make the round trip from ALU to register file, and back again, which saves hardware.

Register operands can be loaded without dependence checking, which makes instruction issue easier and more hardware-streamlined.

If batch size matched the hardware, a dependent instruction could stall.
There's also more work to be done by the scheduler, since it can't take 8 cycles to prep the next instruction.
 
Knowledgeable? Not really ... I just like making guesses. On one hand I'd say it shouldn't really matter, on the other both NVIDIA and ATI make their warps/wavefronts larger than the width of their architecture (even including double pumping on NVIDIA's side). So I'm obviously missing something.
Maybe its just the obvious answers, then.

Smaller warps means you need to manage more of them simultaneously to do the same latency hiding. Fewer cycles between warp switches means you need more instruction/constant bandwidth and more flexible register file multiplexing/routing.

For everything else in the SIMDs, such as the arithmetic pipeline etc, do you think it's safe to say that per-clock instruction switching isn't a big cost?
 
I'm mostly guessing, but I think there are some benefits to the relaxed timing between successive instructions.
Wider vector widths in software versus the physical width in the hardware lessens the number of stall cycles the hardware experiences.
I guess. I sort of thought of it the other way, too, i.e. if you're going to do 16 element batches, isn't it cheaper to make one 16-wide SIMD instead of two 8-wide SIMDs?

For example, R600 takes 4 cycles per batch instruction, and it alternates between two batches on each SIMD.

That means a given element's instruction on a SIMD will not hit the N+1 instruction for 8 cycles.
Sure, but the same would happen if it took 1 cycle and sequentially cycled between 8 batches.

One thing I wondered is why R600 (and later) didn't have a 64x1D single-clock-switch SIMD instead of 16x5x1D 4-clock-switch SIMD, if you know what I mean. Let's ignore the 5th channel for now. This would give ATI the same dependency-free scalar performance that NVidia has.

Both have pretty much the same instruction bandwidth (1D per clock vs. 4x1D every 4 clocks) and work on the same batch size. There's a little difference in scheduling cost for the same reason. Those 8 cycling batches in this design could be loaded in every 32 cycles since that's how long it would take to run the same number of instructions as the old design (a pair of batches every 8 cycles).

Am I making sense?
 
I already have the vantage scores and crysis scores for single HD4870, looking at those graphs its just wrong! Is that simple enough?
Sure, but that rationale is a little different than what you were trying to do with the Tweaktown comparison.

So what are you saying now? You have another source for HD 4870 numbers in Crysis and they don't match those graphs?
 
Its a great chip compared to RSX thats for sure (at least in terms of its design and functionality) but thats more because RSX was pretty poor for its timeframe.

What I mean is, Xenos is clearly a great design on paper, and it also comes packed with great functionality but the same can be said of R600. We mark R600's "greatness" down because it didn't perform as well as we expected. I'm just not seeing why we should assume Xenos is a superior implementation of the architecture when R600 came second and had time to learn from and refine the Xenos design.

E.g. in terms of overall efficiency of the implementation it looks like:

R600 -> R670 -> R770

That also matches the timing of their releases which is to be expected as each evolved from the one before. Xenos performance is an unknown but timing wise it does slot into the above picture before R600 so if we're going to make assumptions about its efficiently it seems more sensible that those assumptins fit into the above picture. Assuming Xenos is as efficient an implementation of that basic architecture as R770 seems a bit baseless to me. More likely its an as efficient implementaion or less so than R600.

Xenos is not as related to the R600 as you might think at first glance, team compositions for the two projects were somewhat different, and R600 is quite different in practice from Xenos.
 
Sure, but that rationale is a little different than what you were trying to do with the Tweaktown comparison.

So what are you saying now? You have another source for HD 4870 numbers in Crysis and they don't match those graphs?

well I was giving an example of the scaling with the different reses, but in any case, they are off, the HD4870, gets around x3500-x3600 in vantage, and is around 20% give or take 5% on different levels, faster in crysis over a 9800 gtx it gets close to a GTX 260, but doesn't beat it.
 
Last edited by a moderator:
I already have the vantage scores and crysis scores for single HD4870, looking at those graphs its just wrong! Is that simple enough?

I also have the Vantage scores of a single HD4870. Heck, I even have one of those damn thingies running in my comp right now... And compared to a similar system (albeit that one has a faster CPU) it manages to beat the GF9800GTX by a very healthy margin (not using the cheat-drivers) and in the feature tests it is almost on par (FT3) or even manages to beat the GTX 260 (FT1, FT2, FT5, FT6). HD4870 is fast.
 
Last edited by a moderator:
Casting Nvidia's support for hardware acceleration of the PhysX API as a "cheat" seems somewhat off center to me. If FM wanted a CPU physics test they should've designed one. But what do I know.....

How are the noise/temps on the 4870 shaping up so far? If those are in check it'll probably render the GTX+ inconsequential.
 
And memory isn't a hurdle because you say so? Double the memory and double the bandwidth certainly has the potential to make scaling more linear through higher resolutions on a variety of games.


I'm sorry but if AA isn't being used bandwidth isn't even a factor in Crysis with these cards.
 
Casting Nvidia's support for hardware acceleration of the PhysX API as a "cheat" seems somewhat off center to me. If FM wanted a CPU physics test they should've designed one. But what do I know.....

How are the noise/temps on the 4870 shaping up so far? If those are in check it'll probably render the GTX+ inconsequential.

I'd imagine FM wanted PhysX-owners to get some boost, not nVidia card users since the score boost they get doesn't represent reality in any way, as the card isn't stressed for gfx at the same time as physics.
 
I guess. I sort of thought of it the other way, too, i.e. if you're going to do 16 element batches, isn't it cheaper to make one 16-wide SIMD instead of two 8-wide SIMDs?
A single wide SIMD amortizes a fair amount of control and scheduling hardware over more ALUs, yes.
Though if the batch size is 16 we're back at a SIMD that runs through a batch instruction in one cycle.

Sure, but the same would happen if it took 1 cycle and sequentially cycled between 8 batches.
That would quadruple the number of thread contexts that the SIMD sequencers would have to pick through in order to set up the SIMD's execution schedule.
Similarly, there would be four times as many instruction queues.

Whatever storage that holds the instructions for an ALU clause would be accessed every cycle, as opposed to twice every 8 cycles.

One thing I wondered is why R600 (and later) didn't have a 64x1D single-clock-switch SIMD instead of 16x5x1D 4-clock-switch SIMD, if you know what I mean. Let's ignore the 5th channel for now. This would give ATI the same dependency-free scalar performance that NVidia has.
It would require four times the branch units to resolve branches, and a complex operation like a transcendental or integer multiply would require multiplying the number of transcendental units by a factor of four to keep that throughput equivalent.

Both have pretty much the same instruction bandwidth (1D per clock vs. 4x1D every 4 clocks) and work on the same batch size. There's a little difference in scheduling cost for the same reason.
There's more queueing going on with the more finely divided scheduling.
The act of decoding instructions also happens much more frequently.
4x1D only decodes once every four clocks, the other option decodes every clock.

It also appears that the single lane layout in G80 was a stumbling block to getting higher DP FLOPs, or AMD just lucked out that its scheme allowed for a quicker path to DP math.

I'm not sure which scheme is better.
The wider AMD model takes a given number of instructions and runs them over a much wider number of elements, so the costs of the program itself and the setup are cheaper per element.
On the other hand, it is more wasteful when the workload isn't as broad.

The more narrow model takes more hardware to schedule, and fewer elements means that the overheads per batch are spread out over a smaller number of elements. There's much less waste at the margins, though.
 
I'd imagine FM wanted PhysX-owners to get some boost, not nVidia card users since the score boost they get doesn't represent reality in any way, as the card isn't stressed for gfx at the same time as physics.

If they wanted that they should've designed a combined CPU/GPU/Physics/AI test. Having separate tests and summing them together based on some arbitrary weighting wouldn't tell you anything about how a separate PhysX PPU would help in games either.

The point is that FM designed a standalone PhysX test. It was their decision to do this and then arbitrarily factor the result into the final score. Of course this was done based on the assumption that PhysX would run either on a CPU or PPU but now that it's GPU accelerated that is FM's miscalculation not Nvidia "cheating". I find the whole notion to be absurd.
 
If they wanted that they should've designed a combined CPU/GPU/Physics/AI test. Having separate tests and summing them together based on some arbitrary weighting wouldn't tell you anything about how a separate PhysX PPU would help in games either.

The point is that FM designed a standalone PhysX test. It was their decision to do this and then arbitrarily factor the result into the final score. Of course this was done based on the assumption that PhysX would run either on a CPU or PPU but now that it's GPU accelerated that is FM's miscalculation not Nvidia "cheating". I find the whole notion to be absurd.

By FM's rules, the results are not valid. Comparing GPU physx enabled results to non-physx enabled results becomes a meaningless comparison. What is unclear about that?

Your whole problem seems to be with the word cheating. Call it what you want but it makes for useless comparisons if reviewers choose to use those results.
 
Your whole problem seems to be with the word cheating. Call it what you want but it makes for useless comparisons if reviewers choose to use those results.

Yeah I do have a problem with the word. Especially coming from people who should be a bit more insightful.

My point is that the comparison would be useless with a PPU as well. Or an 8-core CPU. The combined 3dmark score was useless from the beginning. This just makes it even more so. The whole "Nvidia is cheating again" mantra is very catchy but this time it seems to be bred from ignorance more than anything else.

In any case, how would Nvidia avoid "cheating" in this case? Doesn't 3dmark just make calls to the PhysX API?
 
Status
Not open for further replies.
Back
Top