AMD: R7xx Speculation

Status
Not open for further replies.
Cpu's have multiple subunits, yet they can still be called dual core. What I mean is that I've heard that it's the same architecture doubled up instead of just being one large chip. (For flexibility)

Actually about fab capacity, I heard that they had a new fab up near Canada, in New York state. I heard that this fab could be used to make GPU's.

Wouldn't it be expected that they would start making them now on their fabs, as a prelude to making the fusion chip with an r770 core?

:)

Where do you hear these things?Oh boy. The bolded part is silly. The fab capacity part is untrue, you don't hear about a new fab showing up, these are multi-billion dollar investments. AMD lacks fab capacity ATM and also lacks a process that would be adequate for ATI's needs AFAIK.

The RV770 is done at TSMC. There are a number of good reasons for this.
 
Sorry I assumed that a two part CPU had to be fabbed at the same time in the same plant. My bad
 
From the sounds of what that is based on... its not speculation.

Since it supposedly came from a leaked presentation, its either faked or real. Putting your speculations into a fake AMD presentation does not seem like speculation to me. If the presentation is authentic than we have some actual information.

I love these type of threads.:D
 
Wow, yet more shader power in there -- how is this already, 96 VLIW units?
It would seem reasonable to presume that the ALU:TEX ratio in RV770 is the same as in RV670, 4:1.

So, 24 TUs at 1050MHz = 25,200M texels/s bilinear, versus 12,432 for RV670, 103%. So that explains the texture rate, but without an actual doubling in the count of TUs, merely 50% more units.

Jawed
 
Wow, yet more shader power in there -- how is this already, 96 VLIW units?

No, it's 6 SIMDs. ;)
The question is, what do they mean by "32 TMUs".
The RV770 has 32 sampler units? Nice idea, but it costs a lot of transistors.
Or:
The RV770 can filter 32 pixels per clock. I think, this sounds better.
 
Wow, yet more shader power in there -- how is this already, 96 VLIW units?

[strike]Possible[/strike], the question is how they will get the 32 TFs: texture-SIMD with 4+4 quads, two texture-SIMDs with each 4 quads or Octa-TMUs?

Maybe this information is more reliable than I earlier thought, but the biggest problem are the high-clocks combined with 55nm, ~ 1B transitors and such relative low TDPs.:???:
 
Interesting. Absolutely pathetic if true, but interesting nonetheless. How long is ATi going to drag on this 16 ROP B.S.? The doubling of TMUs was needed at least a generation ago, and the 50% increase in SPs is just meh. Honestly, this sounds like what RV670 (and even R600) should've been.
 
Wouldn't that imply a batch size and branching granularity of 96?

It's been a while since I've read the threads discussing this for R600, so I can't recall the math involved.
 
Wouldn't that imply a batch size and branching granularity of 96?
Yes.

That's actually a pretty good argument against the configuration I've specified, because the ATI guys gave the impression that 64 is likely to be the batch size they stick with for a while. On the other hand, as overall ALU performance increases, the marginal gain in dynamic branching performance with smaller batches diminishes: it becomes more and more costly to minimise batch size.

(I'm expecting NVidia's next GPU design to go to a batch size of 32, whereas G80 etc. are actually at 16. But it's looking like NVidia will settle on 32 for at least the mid term, bringing some stability to the way people program it.)

It is possible to come up with a 6 SIMD configuration (which has a batch size of 64). This would then require either 16 or 32 TUs (each quad in the SIMD has access to either 1 or 2 TU quads).

16 TUs would make for an ALU:TEX ratio of 6:1, while 32 TUs would make for a 3:1 ratio.

So, ahem, it comes down to what you think is mostly likely:
  • 4:1 - 96:24
  • 6:1 - 96:16
  • 3:1 - 96:32
The number of RBEs, 16, implies 4 SIMDs, i.e. that each SIMD is 24 wide, and that there are 24 TUs. Unless I've missed something...

Jawed
 
Four SIMD groups? I was hoping for more wide MIMD config with the same batch size as now. ;)
Of course, that would imply still 16 bilerps, if we keep the horizontal structural alignment.
 
Sounds possible, but they say, the RV770 has 32 TMUs. With 24 wide SIMDs it cann't work.
I put the strongest likelihood on the RBE count - 16 is a number that doesn't need to increase (colour fillrates don't need to increase so much). But the MSAA-sample/zixel rate does need to increase, so I have my fingers crossed that there's 4x Z per clock, not RV670's 2x.

With 16 RBEs (4 quads) it seems unavoidable to me that there are 4 SIMDs.

As I've shown already the texturing performance doubles, because of the 50% increase in units and the 35% increase in clocks.

But hey, I've been bitten a few times by ATI's SIMD v TU configurations...

Jawed
 
Four SIMD groups? I was hoping for more wide MIMD config with the same batch size as now. ;)
Of course, that would imply still 16 bilerps, if we keep the horizontal structural alignment.
Well, to be fair, you can also have 32 TUs in total - which would produce a fairly startling (for ATI) 33,600M texels/s.

Then you just need to share six 16-wide SIMDs with 4 quad-RBE units. That's not quite the same as 3 quad RBEs sharing a 256-bit bus (something we see in RV570), but it's similar...

I'd like to say there's no chance that there's only 16 TUs in RV770 - but I had that same gut feeling about R600 way back, because I knew what that implied. Sigh.

Jawed
 
The other minor puzzlement here is, why ATi is increasing the raw ALU capacity from the already abundant one in R6 series. Are they planing to export even more fixed-function op's through the shader array (things from the TEX domain, for instance)?
 
Last edited by a moderator:
The other minor puzzlement here is, why ATi is increasing the raw ALU capacity from the already abundant one in R6 series. Are they planing to export even more fixed-function op's through the shader array (the TEX domain)?
Perhaps they'd do pixel blending (RBE) stuff there. Blending isn't hugely different from AA resolve (but there are different methods of blending so the overall complexity of the "shader" is slightly higher).

So the RBEs would end up as compress/decompress colour/Z/stencil with Z/stencil testing and MSAA encoding/decoding (essentially more compression/decompression).

I have to admit, I do wonder whether Z/stencil testing would make the move into becoming a shader program at the same time. It's a set of operations that doesn't, as far as I can tell, suffer performance drop depending on the batch size. i.e. dynamic branching granularity is not going to negatively affect performance here (after all we're seemingly looking at a GPU that can perform 1 instruction on 96 pixels in parallel). So, is the actual throughput possible? I think it could be...

If Z/stencil testing also becomes a shader program then the RBEs are just doing compression/decompression. Transmitting pixels+zixels back and forth to the shader pipe and providing data to the hierarchical Z/stencil unit.

Bit of a long shot though. I wouldn't be surprised if R7xx does nothing extra here, it's major change being nothing more than being optimised properly for "multi-chip".

Jawed
 
Status
Not open for further replies.
Back
Top