AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
[snip]

It's worth noting that an operand need not be consumed on the next "cycle". It persists as long as the lane that produces results is "masked out" (or NOP, erm, not sure now) on later cycles.

It's also worth remembering that the pipeline length of the ALUs is 8, i.e. an operand stored in-pipeline actually persists for multiples of 8 cycles.
That's interesting, so there's more to it than the use of a forwarding network and those could be real registers after all. I stand corrected :) BTW as a compiler writer I'd love to see the algorithm they are using in the shader compiler for register allocation. Modeling those 'registers' in conventional algorithms is probably not possible and to use them effectively you probably also need to tweak re-materialization and common sub-expression elimination.
 
I don't know much about compilation, but here's some stuff:

http://v3.espacenet.com/publication...005166194A1&DB=EPODOC&locale=en_GB&CC=US&FT=D

http://v3.espacenet.com/publication...005198468A1&DB=EPODOC&locale=en_GB&CC=US&FT=D

http://v3.espacenet.com/publication...006070050A1&DB=EPODOC&locale=en_GB&CC=US&FT=D

http://ati.amd.com/developer/cgo/2008/Rubin-CGO2008.pdf

I've had some grief with the compiler:

http://forum.beyond3d.com/showpost.php?p=1286449&postcount=652

where register allocation has gaping holes, meaning most allocated registers are never used.

What do you think of using the parallel brute force of GPUs for compilation? Any possible gain?

D3D11 has dynamic shader linking - sounds like a nightmare for the compiler writers to me and I can't help thinking it makes a right old mess of register allocation/thread-spawning. Don't really understand how that's going to work though.

Jawed
 
D3D11 has dynamic shader linking - sounds like a nightmare for the compiler writers to me and I can't help thinking it makes a right old mess of register allocation/thread-spawning. Don't really understand how that's going to work though.

Why should that be a problem. I am no compiler guru, but a compiler would typically convert each module into an intermediate representation and then build everything into one program object. If I understand this correctly, OpenGL has had it since 2.0.
 
Why should that be a problem. I am no compiler guru, but a compiler would typically convert each module into an intermediate representation and then build everything into one program object. If I understand this correctly, OpenGL has had it since 2.0.
DSL is an alternative to the use of ubershaders. I am a total OpenGL noob, so I can't make the comparison.

http://www.gamedev.net/community/forums/topic.asp?topic_id=530547

http://markmail.org/message/mbqkmsxyxkxn6z3w

http://www.nvidia.com/content/nvisi...oper_Track/NVISION08-Direct3D_11_Overview.pdf

What I think could be problematic is the way that each module is compiled in ignorant bliss of the others. How to re-use registers across the modules if they're all statically compiled already? Against this, is the argument that an uber-shader always has the worst-case register allocation. So in theory the DSL should use less registers - but I'm not convinced the combinatorial explosion induced by the modules "private register allocations" is going to produce a happily co-existing population of registers.

Jawed
 
What I think could be problematic is the way that each module is compiled in ignorant bliss of the others. How to re-use registers across the modules if they're all statically compiled already? Against this, is the argument that an uber-shader always has the worst-case register allocation. So in theory the DSL should use less registers - but I'm not convinced the combinatorial explosion induced by the modules "private register allocations" is going to produce a happily co-existing population of registers.

Well that's the beauty of LTO. Each module is compiled only down to compiler's IR and then all modules are linked together to make the final module and then you optimize.

Caution: This is only one approach to doing link time optimization.

http://en.wikipedia.org/wiki/Link-time_optimization

LLVM, the toolset used in many JIT engines. says here
 
GPUs are different from CPUs though. GPUs (excluding Larrabee) have to allocate register file in a trade-off with latency-hiding. And, theoretically, in a trade-off with other shader types that can be executing on the clusters at the same time (e.g. vertex shaders, geometry shaders etc.).

CPUs don't (in these modern times, generally) have a static set of variables that can fit entirely within registers for the duration of a program.

To be honest I've seen practically nothing on the co-habitation of registers for multiple shader types in GPU clusters. I've seen some discussion on Xenos, but that's all I can remember. It seems to me a lethally complex subject.

Now, if all the GPUs switch to "small register files" like Larrabee, where the registers are nothing more than scratch-pad for the ALUs, then that makes things quite different. That'd be a huge change for ATI and NVidia.

Jawed
 
Yes, gpu' tradeoff latency hiding v/s register use. On nv gpu's atleast, the compilers flatten all calls into stright code and then perform optimizations, so that would reduce register pressure automatically, wouldn't it?
 
Yes. ATI is the same.

The NVidia presentation I linked earlier makes it seem like linkage is done before driver compilation, so maybe I'm just getting excited over nothing.

Jawed
 
that would put RV870 tape-out a couple of months behind us. but that doesn't rhyme with RV740/40nm "scarcity" on the desktop. That is unless RV870 isn't 40nm and we're seeing ATI's "G80" in the making here.
 
Assuming the terrascale architecture is still the architecture of choice in RV870, I would like to see a higher die space budget in RV870 (even factoring in 40nm). I think they were too conservative with RV770 despite how efficient it is per mm2, I think with an extra 100mm2 of real estate it could of laid a whopping on GT200 while still being smaller. Hopefully close to <400mm2.
 
Assuming the terrascale architecture is still the architecture of choice in RV870, I would like to see a higher die space budget in RV870 (even factoring in 40nm). I think they were too conservative with RV770 despite how efficient it is per mm2, I think with an extra 100mm2 of real estate it could of laid a whopping on GT200 while still being smaller. Hopefully close to <400mm2.
AMD's problem seems to be power, if they still want to build an X2 card. So if power goes up by 30%+ per GPU, X2 is looking dicey.

If AMD's sticking with 16 RBEs with 4xZ per clock there's also a fundamental question of just how much faster it could go. Let's say it launches at 1GHz, that's <18% faster than HD4890.

Why do I get the feeling that RV870, in comparison with GT300, is going to be short of Z-rate, like R600 was at launch?

Jawed
 
They manage to sell the RV770 on cards as low as 80 euros (lowest price point 4830), pretty mind boggling (and still a 256bit part). that chip made $500 graphics cards obsolete on its own (dual GPU excluded), pushing down the last remaining costly part of a gaming PC.

now the price pressure has been such that even GT200 has become affordable, with a GTX260/216 at 170 euros. It's pretty insane.
In that light, even if AMD could have made an even faster chip, they made a great decision.
RV740 is another insane GPU and so will be RV870. (GT215 and GT300 probably will, too)
 
AMD's problem seems to be power, if they still want to build an X2 card. So if power goes up by 30%+ per GPU, X2 is looking dicey.

If AMD's sticking with 16 RBEs with 4xZ per clock there's also a fundamental question of just how much faster it could go. Let's say it launches at 1GHz, that's <18% faster than HD4890.

Why do I get the feeling that RV870, in comparison with GT300, is going to be short of Z-rate, like R600 was at launch?

Jawed

I actually have the feeling it's finally going to have something better than 16 pixels per clock.
If they have to ditch their x2, then so be it. Multiple PCIe slot motherboards are a common norm these days and NV chipsets also being the unpopular chipsets of both platforms. I think they won't be missing much with a crossfire stick.
 
I forgot that - ironic really after getting all excited about what RV740 might be. Hmm, yeah, that would be groovy.

Jawed
 
I forgot that - ironic really after getting all excited about what RV740 might be. Hmm, yeah, that would be groovy.

Jawed

I'm going to have to find those estimates you made on diesize/percentages for the different types of units in R770.

My rough speculation was a 1600SP(32spx10c) part w/ 32ROPs, 80TMUs, 256bit bus w/ 6.3ghz GDDR5(200Gbps) with the at least 650mhz to make 2Tflops. Fitting into a die size around RV770, ~250mm2 w/ ~1.3b trannies.

EDIT- Most recent one I found was this one...

Only 40% of RV770 is clusters, about 390M transistors. Of that, I estimate 64% is ALUs, excluding the redundant ALU lanes, which I like to lump in with the TU when looking at a die shot and which makes scaling estimates a bit simpler.

So RV770 has about 250M transistors for its 800 ALU lanes and 140M for the rest. RV740's 640 lanes would be ~200M transistors, subject to new functionality and lack of double precision. The clusters, as a whole, would be about 312M, leaving, as you observe, a hell of a lot of transistors, 514M, for MCs, RBEs, the hub, PCI Express etc.

This compares with ~566M in RV770 :oops:

Analogue stuff isn't supposed to shrink particularly well - dunno how to account for that.

Jawed

but I thought I remembered one of your posts breaking down the different units even more...
 
Last edited by a moderator:
Back
Top