AMD: R7xx Speculation

Status
Not open for further replies.
well still have to have transistors for routing data in the larger processes, I remember the g70 to g71 shrink that was one place where nV shaved off some transistors, don't know how many though but it could be significant.

That was a mere 24 million transistors(and in todays GPU's means a lot less) coming from a full node step downwards(110nm ---> 90nm)mainly shortening the pipeline because of the higher clocks 90nm allows(faster switching). So I hardly think a half step(65nm---> 55nm) will yield the same regardless if the same to what happened to G71 is even applicable to G94 to begin with.
Hmm looking at the figures now, I am willing to bet the control logic in the g8x and g9x chips are significantly more then the AMD counterparts?

And control logic is somehow not counted in the transistor budget? Which brings me right back to my question that I asked Arun, is their some sort of differences between the actual transistor sizes them selves(theoretically speaking if G94 was on 55nm, but the comparison does not have to be just RV670 and G94) or does chip layout play a significant role here?

Of course more TMU's and ROP's and more robust TMU's as well.

Twice the filtering and ?address units?, less samplers, same amount of ROPs(or pixel capability that is), not exactly sure what you mean by more "robust" TMU's either, and not exactly sure how that even fits in with the topic at all.
 
Twice the filtering and ?address units?, less samplers, same amount of ROPs(or pixel capability that is), not exactly sure what you mean by more "robust" TMU's either, and not exactly sure how that even fits in with the topic at all.

Less samplers?
 
well still have to have transistors for routing data in the larger processes, I remember the g70 to g71 shrink that was one place where nV shaved off some transistors, don't know how many though but it could be significant.
It really depends. On smaller processes you probably spend more transistors on repeaters for wires.


That was a mere 24 million transistors(and in todays GPU's means a lot less) coming from a full node step downwards(110nm ---> 90nm)mainly shortening the pipeline because of the higher clocks 90nm allows(faster switching). So I hardly think a half step(65nm---> 55nm) will yield the same regardless if the same to what happened to G71 is even applicable to G94 to begin with.
Why would it be less in today's GPUs? GPU clocks and functionality have been growing even with faster transistor switching speeds. Also the speed improvement from scaling is slowing down these days. In the G70->G71 case it could just be that they were very conservative with the G70 and less so with the G71. Or they could have engineered faster math circuits. Or some combination of the above. It's not just a function of the manufacturing process.


And control logic is somehow not counted in the transistor budget? Which brings me right back to my question that I asked Arun, is their some sort of differences between the actual transistor sizes them selves(theoretically speaking if G94 was on 55nm, but the comparison does not have to be just RV670 and G94) or does chip layout play a significant role here?
The transistor density could certainly be different for different designs, and it could also be a choice for yield optimization. The smallest/densest layout may not have the best yields.
 
Less samplers?

FP32 texture sampling units. I'm not sure if they are coupled with the address units on G80 and which would be no more than 64 or if they even exist at all on G80!!! But in any case, R600/RV670 have 20 FP32 Texture samplers per texture block for a total of 80.

It really depends. On smaller processes you probably spend more transistors on repeaters for wires.

Why would it be less in today's GPUs?

I was talking about the diminishing importance of 24 million transistors on a 500m chip as compared to a 304m chip and never mind the node differences!

Also the speed improvement from scaling is slowing down these days. In the G70->G71 case it could just be that they were very conservative with the G70 and less so with the G71. Or they could have engineered faster math circuits. Or some combination of the above. It's not just a function of the manufacturing process.

The transistor density could certainly be different for different designs, and it could also be a choice for yield optimization. The smallest/densest layout may not have the best yields.

Thanks, thats seems like a reasonable perspective and makes plenty sense.
 
FP32 texture sampling units. I'm not sure if they are coupled with the address units on G80 and which would be no more than 64 or if they even exist at all on G80!!! But in any case, R600/RV670 have 20 FP32 Texture samplers per texture block for a total of 80.

Those 80 samplers correspond to 80 texels per clock retrieved giving a total of 16 bilerps and 16 point samples. Now consider how many texels G80 retrieves per clock in order to produce 64 bilerps. So in terms of "samplers" G80 has far more than R600. Granted each sampling unit on R600 is a bit beefier as it does full speed FP16 but G80 more than makes up for that by having four times as many.
 
Those 80 samplers correspond to 80 texels per clock retrieved giving a total of 16 bilerps and 16 point samples. Now consider how many texels G80 retrieves per clock in order to produce 64 bilerps. So in terms of "samplers" G80 has far more than R600. Granted each sampling unit on R600 is a bit beefier as it does full speed FP16 but G80 more than makes up for that by having four times as many.

:???: Thanks.. I was strangely confused.
 
I was talking about the diminishing importance of 24 million transistors on a 500m chip as compared to a 304m chip and never mind the node differences!
Ah, sorry, i misunderstood that.
If you look at it in terms of reducing pipeline stages however, a single pipeline stage in a that 500m gpu would probably have more transistors in it than the 300m chip.
 
Finalized RV770 specs?

1.gif

2.gif

3.gif


Source- VrZone

I'm liking the clocks and like everyone else said, these seem a bit more realistic than the 800SP rumor.
 
I did a possible configuration based on this rumour back in February:

b3da007.gif

In the past I've described it as 12 SIMDs. I dislike this idea because that's a lot of control overhead and results in relatively coarse-grained redundancy (60 redundant ALU lanes as compared with 20 in RV670). Alternatively, I suppose, it's possible to implement it as 4 SIMDs - each set of 96 SPs sharing a program counter. That would have 20 redundant ALU lanes - but now the issue is the batch size of 96...

This arrangement is the same type as seen inside R580, where each SIMD is 3 quad ALUs (12 pipes) sharing a single TMU.

So, as a 4 SIMD design I'm not unhappy. Still a bit dubious about it being a 3:1 ALU:TEX ratio, though.

Jawed​
 
Today rumour from chiphell says :

RV770 final specifications
480SP (RV670 320)
Framework used R600, 4D +1 D and D for every 96 (RV670 every 64 D)
32TMU (RV670 than doubled)
Frequency 800 ~ 900MHz, depending on the final outcome of TSMC volume production scheduled listing price (RV670 reference listed prices)
Finally tell you that the version of RV770-how do not think it is RV670 twins, the future price trend can also RV670 reference to the current series.

4D+1D looks like a Xenos core design . But other rumours reject all speculative RV770 specs so far .
 
Still a bit dubious about it being a 3:1 ALU:TEX ratio, though.
Well, I certainly won't complain for the doubled bilerp rate. ;)
The million dol... euro question here is how the batch preprocessing is done, at the top level. Or may be, there will be two-level "distributed" design. ATi really loves round square based structures, here. ;)
 
Framework used R600, 4D +1 D and D for every 96 (RV670 every 64 D)
I interpret that last "D" as a reference to the sequencer, i.e. a sequencer controls groups of 96 SPs, while in R670 a sequencer controls groups of 64 SPs.

4D+1D looks like a Xenos core design .
The 1D in Xenos is a transcendental unit, unable to do MAD, with 2 instructions per clock. In R600 and all later GPUs it's 5D MAD with one lane also doing transcendental (and extra integer instructions), making upto 5 different instructions per clock.

Jawed
 
The million dol... euro question here is how the batch preprocessing is done, at the top level. Or may be, there will be two-level "distributed" design. ATi really loves round square based structures, here. ;)
What do you mean by batch preprocessing?

In R600 there's an interesting hierarchy of processors:

Code:
         Sequencer
         |   |   |
    ------   |   -------
    |        |         |
   ALU    Vertex    Texture

A shader program consists of Sequencer instructions, with some Sequencer instructions being calls to subroutines of type ALU, Vertex or Texture. So you can think of the "shader" as being a network of four types of programmable processor.

Jawed
 
Status
Not open for further replies.
Back
Top