That's interesting, so there's more to it than the use of a forwarding network and those could be real registers after all. I stand corrected BTW as a compiler writer I'd love to see the algorithm they are using in the shader compiler for register allocation. Modeling those 'registers' in conventional algorithms is probably not possible and to use them effectively you probably also need to tweak re-materialization and common sub-expression elimination.[snip]
It's worth noting that an operand need not be consumed on the next "cycle". It persists as long as the lane that produces results is "masked out" (or NOP, erm, not sure now) on later cycles.
It's also worth remembering that the pipeline length of the ALUs is 8, i.e. an operand stored in-pipeline actually persists for multiples of 8 cycles.
D3D11 has dynamic shader linking - sounds like a nightmare for the compiler writers to me and I can't help thinking it makes a right old mess of register allocation/thread-spawning. Don't really understand how that's going to work though.
DSL is an alternative to the use of ubershaders. I am a total OpenGL noob, so I can't make the comparison.Why should that be a problem. I am no compiler guru, but a compiler would typically convert each module into an intermediate representation and then build everything into one program object. If I understand this correctly, OpenGL has had it since 2.0.
What I think could be problematic is the way that each module is compiled in ignorant bliss of the others. How to re-use registers across the modules if they're all statically compiled already? Against this, is the argument that an uber-shader always has the worst-case register allocation. So in theory the DSL should use less registers - but I'm not convinced the combinatorial explosion induced by the modules "private register allocations" is going to produce a happily co-existing population of registers.
doesn't rhyme with RV740/40nm "scarcity" on the desktop.
AMD's problem seems to be power, if they still want to build an X2 card. So if power goes up by 30%+ per GPU, X2 is looking dicey.Assuming the terrascale architecture is still the architecture of choice in RV870, I would like to see a higher die space budget in RV870 (even factoring in 40nm). I think they were too conservative with RV770 despite how efficient it is per mm2, I think with an extra 100mm2 of real estate it could of laid a whopping on GT200 while still being smaller. Hopefully close to <400mm2.
AMD's problem seems to be power, if they still want to build an X2 card. So if power goes up by 30%+ per GPU, X2 is looking dicey.
If AMD's sticking with 16 RBEs with 4xZ per clock there's also a fundamental question of just how much faster it could go. Let's say it launches at 1GHz, that's <18% faster than HD4890.
Why do I get the feeling that RV870, in comparison with GT300, is going to be short of Z-rate, like R600 was at launch?
Jawed
Why don't use similar ROPs/bus ratio as for RV740?Why do I get the feeling that RV870, in comparison with GT300, is going to be short of Z-rate, like R600 was at launch?
Jawed
I forgot that - ironic really after getting all excited about what RV740 might be. Hmm, yeah, that would be groovy.
Jawed
Only 40% of RV770 is clusters, about 390M transistors. Of that, I estimate 64% is ALUs, excluding the redundant ALU lanes, which I like to lump in with the TU when looking at a die shot and which makes scaling estimates a bit simpler.
So RV770 has about 250M transistors for its 800 ALU lanes and 140M for the rest. RV740's 640 lanes would be ~200M transistors, subject to new functionality and lack of double precision. The clusters, as a whole, would be about 312M, leaving, as you observe, a hell of a lot of transistors, 514M, for MCs, RBEs, the hub, PCI Express etc.
This compares with ~566M in RV770
Analogue stuff isn't supposed to shrink particularly well - dunno how to account for that.
Jawed