Quote:
Originally Posted by Mintmaster
(Post 983003)
I don't know how I can explain it any simpler than I did in the last post. For the last time, DC is not talking about a vector instruction on R300 vs. a vector instruction on R300. He is simply pointing out that if ADD co-issue wasn't done on R300, no big deal. If scalar co-issue isn't done on R600, it is a big deal. He's talking about extracting the max throughput possible. Shader code is more likely to have lots of scalar and vec2 code than a crapload of ADDs.
|
Which will run at lower utilisation on R300 than on R600 if no co-issue is possible (due to dependency). Further, for scalar or vec2 instructions, any co-issue that can be identified by the R300 compiler is going to work on R600. So, again, R600 comes out better than R300.
Quote:
ATI obviously put a lot of work and die area into making R600 a 5x1D architecture, clearly for the purpose of improving speed more than could be done by extra Vec4+1D units instead. Not getting co-issue right in the compiler for R600would be a lot more damaging (especially in comparison to G80) than with R300.
|
There's a lot of low-hanging fruit in co-issue for this architecture, though. As I keep saying vec3/vec4 instructions make a mockery of the suggestion that ATI would be struggling with a compiler that can co-issue in the most basic cases. And that basic case makes up a lot of code.
Unravelling dependencies and eliminating dead code amongst vector instructions is what the set of static single assignment patent applications is all about. Sure, they make DemoCoder sneer in their obviousness - but they lie at the heart of making the compiler do the non-trivial co-issues that you guys are so desperate to show are impossibly complicated and bound to make R600 fall on its own sword.
I've got no argument that there are corner cases of tightly-dependent code that will run like a dog on R600 and I'm under no illusion that co-issue is generally trivial. The devrel guys are always begging gamedevs to explicitly mask their outputs - so that should be clue enough that the compiler guys have a hard time...
Quote:
Oh, okay.
So basically you're suggeting the same rationale that I did previously. The batch size is too small for one quarter of the ALUs to all operate on the same channel like in G80.
|
No, my suggestion is that having a 4-way tiled architecture, they wouldn't then want to split-up each of the 4 tiles of batch scheduling with a more advanced sequencer. The sequencer from R5xx is enough (one ALU batch, one texture batch - roughly speaking) instead of using the Xenos sequencer (multiple ALU batches and one texture batch - roughly speaking).
Quote:
I still think that's the way to go. Just use a 64-pixel batch size. It won't make that much difference, and I certainly think it would be much less than from near-perfect utilization.
|
But now you have spent more die space on the sequencers to get the same batch size. The payback is that two consecutive and dependent scalar or vec2 instructions will run at full speed. Maybe the payback isn't worth it?
---
Last night I realised that R600's ALU organisation offers another fundamental advantage (a direct inheritance from Xenos, prolly). Since each of the four shader units contains five 16-way ALUs, ATI's fine-grained ALU-redundancy scheme works a charm. In this setup, a 17th ALU pipeline is added to each array. So the redundancy overhead is 6%. (The theory is that each of Xenos's three 16-way ALUs have a 17th pipeline for redundancy.)
If R600 was implemented as lots of smaller ALUs (e.g. to build a sequential scalar GPU like G80) then the redundancy overhead would be significantly higher.
Again, another indication of evolution...
Jawed