AMD: R8xx Speculation

I.S.T. · Apr 18, 2009

The URL doesn't work, Jawed.

Jawed · Apr 18, 2009

Fixed it - cannot believe the forum software is actually editing the URL as it's pasted.

Jawed

crystall · Apr 18, 2009

Jawed said:
[snip]

It's worth noting that an operand need not be consumed on the next "cycle". It persists as long as the lane that produces results is "masked out" (or NOP, erm, not sure now) on later cycles.

It's also worth remembering that the pipeline length of the ALUs is 8, i.e. an operand stored in-pipeline actually persists for multiples of 8 cycles.

That's interesting, so there's more to it than the use of a forwarding network and those could be real registers after all. I stand corrected

BTW as a compiler writer I'd love to see the algorithm they are using in the shader compiler for register allocation. Modeling those 'registers' in conventional algorithms is probably not possible and to use them effectively you probably also need to tweak re-materialization and common sub-expression elimination.

Jawed · Apr 18, 2009

I don't know much about compilation, but here's some stuff:

http://v3.espacenet.com/publication...005166194A1&DB=EPODOC&locale=en_GB&CC=US&FT=D

http://v3.espacenet.com/publication...005198468A1&DB=EPODOC&locale=en_GB&CC=US&FT=D

http://v3.espacenet.com/publication...006070050A1&DB=EPODOC&locale=en_GB&CC=US&FT=D

http://ati.amd.com/developer/cgo/2008/Rubin-CGO2008.pdf

I've had some grief with the compiler:

http://forum.beyond3d.com/showpost.php?p=1286449&postcount=652

where register allocation has gaping holes, meaning most allocated registers are never used.

What do you think of using the parallel brute force of GPUs for compilation? Any possible gain?

D3D11 has dynamic shader linking - sounds like a nightmare for the compiler writers to me and I can't help thinking it makes a right old mess of register allocation/thread-spawning. Don't really understand how that's going to work though.

Jawed

rpg.314 · Apr 18, 2009

D3D11 has dynamic shader linking - sounds like a nightmare for the compiler writers to me and I can't help thinking it makes a right old mess of register allocation/thread-spawning. Don't really understand how that's going to work though.

Why should that be a problem. I am no compiler guru, but a compiler would typically convert each module into an intermediate representation and then build everything into one program object. If I understand this correctly, OpenGL has had it since 2.0.

Jawed · Apr 18, 2009

rpg.314 said:
Why should that be a problem. I am no compiler guru, but a compiler would typically convert each module into an intermediate representation and then build everything into one program object. If I understand this correctly, OpenGL has had it since 2.0.

DSL is an alternative to the use of ubershaders. I am a total OpenGL noob, so I can't make the comparison.

http://www.gamedev.net/community/forums/topic.asp?topic_id=530547

http://markmail.org/message/mbqkmsxyxkxn6z3w

http://www.nvidia.com/content/nvisi...oper_Track/NVISION08-Direct3D_11_Overview.pdf

What I think could be problematic is the way that each module is compiled in ignorant bliss of the others. How to re-use registers across the modules if they're all statically compiled already? Against this, is the argument that an uber-shader always has the worst-case register allocation. So in theory the DSL should use less registers - but I'm not convinced the combinatorial explosion induced by the modules "private register allocations" is going to produce a happily co-existing population of registers.

Jawed

rpg.314 · Apr 18, 2009

What I think could be problematic is the way that each module is compiled in ignorant bliss of the others. How to re-use registers across the modules if they're all statically compiled already? Against this, is the argument that an uber-shader always has the worst-case register allocation. So in theory the DSL should use less registers - but I'm not convinced the combinatorial explosion induced by the modules "private register allocations" is going to produce a happily co-existing population of registers.

Well that's the beauty of LTO. Each module is compiled only down to compiler's IR and then all modules are linked together to make the final module and then you optimize.

Caution: This is only one approach to doing link time optimization.

http://en.wikipedia.org/wiki/Link-time_optimization

LLVM, the toolset used in many JIT engines. says here

Jawed · Apr 18, 2009

GPUs are different from CPUs though. GPUs (excluding Larrabee) have to allocate register file in a trade-off with latency-hiding. And, theoretically, in a trade-off with other shader types that can be executing on the clusters at the same time (e.g. vertex shaders, geometry shaders etc.).

CPUs don't (in these modern times, generally) have a static set of variables that can fit entirely within registers for the duration of a program.

To be honest I've seen practically nothing on the co-habitation of registers for multiple shader types in GPU clusters. I've seen some discussion on Xenos, but that's all I can remember. It seems to me a lethally complex subject.

Now, if all the GPUs switch to "small register files" like Larrabee, where the registers are nothing more than scratch-pad for the ALUs, then that makes things quite different. That'd be a huge change for ATI and NVidia.

Jawed

rpg.314 · Apr 18, 2009

Yes, gpu' tradeoff latency hiding v/s register use. On nv gpu's atleast, the compilers flatten all calls into stright code and then perform optimizations, so that would reduce register pressure automatically, wouldn't it?

Jawed · Apr 19, 2009

Yes. ATI is the same.

The NVidia presentation I linked earlier makes it seem like linkage is done before driver compilation, so maybe I'm just getting excited over nothing.

Jawed

itaru · Apr 20, 2009

http://translate.google.co.jp/translate?u=http://www.hkepc.com/2774&sl=zh-CN&tl=en&hl=ja&ie=UTF-8

A new round of war Direct X11
ATI RV870 is scheduled to account for a head start on July~August debut

neliz · Apr 21, 2009

that would put RV870 tape-out a couple of months behind us. but that doesn't rhyme with RV740/40nm "scarcity" on the desktop. That is unless RV870 isn't 40nm and we're seeing ATI's "G80" in the making here.

AlexV · Apr 21, 2009

neliz said:
doesn't rhyme with RV740/40nm "scarcity" on the desktop.

Considering the RV740 hasn't launched yet, I'd think it fairly hard to determine how it's availability truly is.

Sound_Card · Apr 23, 2009

Assuming the terrascale architecture is still the architecture of choice in RV870, I would like to see a higher die space budget in RV870 (even factoring in 40nm). I think they were too conservative with RV770 despite how efficient it is per mm2, I think with an extra 100mm2 of real estate it could of laid a whopping on GT200 while still being smaller. Hopefully close to <400mm2.

Jawed · Apr 23, 2009

Sound_Card said:
Assuming the terrascale architecture is still the architecture of choice in RV870, I would like to see a higher die space budget in RV870 (even factoring in 40nm). I think they were too conservative with RV770 despite how efficient it is per mm2, I think with an extra 100mm2 of real estate it could of laid a whopping on GT200 while still being smaller. Hopefully close to <400mm2.

AMD's problem seems to be power, if they still want to build an X2 card. So if power goes up by 30%+ per GPU, X2 is looking dicey.

If AMD's sticking with 16 RBEs with 4xZ per clock there's also a fundamental question of just how much faster it could go. Let's say it launches at 1GHz, that's <18% faster than HD4890.

Why do I get the feeling that RV870, in comparison with GT300, is going to be short of Z-rate, like R600 was at launch?

Jawed

Blazkowicz · Apr 23, 2009

They manage to sell the RV770 on cards as low as 80 euros (lowest price point 4830), pretty mind boggling (and still a 256bit part). that chip made $500 graphics cards obsolete on its own (dual GPU excluded), pushing down the last remaining costly part of a gaming PC.

now the price pressure has been such that even GT200 has become affordable, with a GTX260/216 at 170 euros. It's pretty insane.
In that light, even if AMD could have made an even faster chip, they made a great decision.
RV740 is another insane GPU and so will be RV870. (GT215 and GT300 probably will, too)

Sound_Card · Apr 23, 2009

Jawed said:
AMD's problem seems to be power, if they still want to build an X2 card. So if power goes up by 30%+ per GPU, X2 is looking dicey.

If AMD's sticking with 16 RBEs with 4xZ per clock there's also a fundamental question of just how much faster it could go. Let's say it launches at 1GHz, that's <18% faster than HD4890.

Why do I get the feeling that RV870, in comparison with GT300, is going to be short of Z-rate, like R600 was at launch?

Jawed

I actually have the feeling it's finally going to have something better than 16 pixels per clock.
If they have to ditch their x2, then so be it. Multiple PCIe slot motherboards are a common norm these days and NV chipsets also being the unpopular chipsets of both platforms. I think they won't be missing much with a crossfire stick.

no-X · Apr 23, 2009

Jawed said:
Why do I get the feeling that RV870, in comparison with GT300, is going to be short of Z-rate, like R600 was at launch?

Jawed

Why don't use similar ROPs/bus ratio as for RV740?

Jawed · Apr 23, 2009

I forgot that - ironic really after getting all excited about what RV740 might be. Hmm, yeah, that would be groovy.

Jawed

LordEC911 · Apr 23, 2009

Jawed said:
I forgot that - ironic really after getting all excited about what RV740 might be. Hmm, yeah, that would be groovy.

Jawed

I'm going to have to find those estimates you made on diesize/percentages for the different types of units in R770.

My rough speculation was a 1600SP(32spx10c) part w/ 32ROPs, 80TMUs, 256bit bus w/ 6.3ghz GDDR5(200Gbps) with the at least 650mhz to make 2Tflops. Fitting into a die size around RV770, ~250mm2 w/ ~1.3b trannies.

EDIT- Most recent one I found was this one...

Jawed said:
Only 40% of RV770 is clusters, about 390M transistors. Of that, I estimate 64% is ALUs, excluding the redundant ALU lanes, which I like to lump in with the TU when looking at a die shot and which makes scaling estimates a bit simpler.

So RV770 has about 250M transistors for its 800 ALU lanes and 140M for the rest. RV740's 640 lanes would be ~200M transistors, subject to new functionality and lack of double precision. The clusters, as a whole, would be about 312M, leaving, as you observe, a hell of a lot of transistors, 514M, for MCs, RBEs, the hub, PCI Express etc.

This compares with ~566M in RV770

Analogue stuff isn't supposed to shrink particularly well - dunno how to account for that.

Jawed

but I thought I remembered one of your posts breaking down the different units even more...

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

I.S.T.

Jawed

crystall

Jawed

rpg.314

Jawed

rpg.314

Jawed

rpg.314

Jawed

itaru

neliz

GIGABYTE Man

AlexV

Heteroscedasticitate

Sound_Card

Jawed

Blazkowicz

Sound_Card

no-X

Jawed

LordEC911

Similar threads