AMD: R7xx Speculation

Status
Not open for further replies.
If games ever actually truly use GPU resources for Physics acceleration I'll welcome that day whole heartedly.

Especially since I wouldn't have to fork over any money for one. Just wait til I upgrade my video card, and voila, instead of having an expensive paper weight (if noone I know needs my old GPU) I have a used Physics acceleration card. :D

Will that ever happen? Who knows, but one can always dream.

It'd be the next best thing to sliced bread, IMO. :)

Regards,
SB
 
Let's remember the supposed doubling of TMUs, the 50% increase in ALUs and the hopefully tweaked ROPs. These improvements should fill up some of that extra bandwidth.
I concur on the ROPs, but i fail to grasp, how on earth AMD would fit +50% Shader-ALUs and double the number of TMUs in just ~33% more die space. TMUs supposedly are pretty large in terms of die space and transistor count - especially those doing FP64 single-cycle…
 
I concur on the ROPs, but i fail to grasp, how on earth AMD would fit +50% Shader-ALUs and double the number of TMUs in just ~33% more die space. TMUs supposedly are pretty large in terms of die space and transistor count - especially those doing FP64 single-cycle…

That somehow puzzles me too. The fact that RV670 was so small was quite a shock already.

Well, the midrange 55nm RV exhibited about the same scaling, giving some percentage for the wider bus too.


What that really puzzles me however, is ATI's stubbornness not to make a G92 (@65nm) sized chip that scales AGAIN... (720 SP-ALUs, 64 TMUs, there's something horrendously wrong- or not. :p)

It would essentially repeat G71 versus R580, sides reversed.


Even if the calculation went horrendously wrong, the chip would still end up max in the early four hundred square mms.
 
I concur on the ROPs, but i fail to grasp, how on earth AMD would fit +50% Shader-ALUs and double the number of TMUs in just ~33% more die space. TMUs supposedly are pretty large in terms of die space and transistor count - especially those doing FP64 single-cycle…
G92 in comparision to G94 doubles both TMU and ALU count and die-space difference is still lower than ~30%. Does nVidia employs magicians? :smile:
 
Thats because die space is taken up by more than just shader-alu's and TMU's....

After all, RV670 removed the 512-bit bus and included UVD but still dropped in transistor count by 34+ million transistors from the R600...
 
G92 in comparision to G94 doubles both TMU and ALU count and die-space difference is still lower than ~30%. Does nVidia employs magicians? :smile:
:D Transistor count went up by ~49% - and different kinds of logic on the GPU take up differing amounts of space, e.g. cache and register file sections can be considerably more dense than the rest.



If you compare RV635 and RV670 transistor counts:
  • TUs+L2 cache increased by 100%
  • memory system increased by 100%
  • ALUs increased by 167%
  • RBEs increased by 300%
All for 76% extra transistors. That assumes both GPUs have the same degree of fine-grained redundancy. Which can't be true simply because the width of SIMD units (such as the ALUs) differ between the two.

~27M transistors for 128->256-bit memory, apparently (compare R600 and RV670). So excluding memory system, about 69% extra transistors.

Apparently we're looking at up to ~35% extra transistors in RV770 - assuming there's not been a radical shake up using custom logic.

Jawed
 
Last edited by a moderator:
Thats on 65nm too. You would think on 55nm, the increase in die area would be even less.
Doubling unit count on 65nm requires less than 30% die-space, while doubling unit count on 55nm not? Less? Or more? I can't get this logic... Manufacturing node is irrelevant, if you count relatively.
 
Doubling unit count on 65nm requires less than 30% die-space, while doubling unit count on 55nm not? Less? Or more? I can't get this logic... Manufacturing node is irrelevant, if you count relatively.

Yep, manufacturing node is irrelevant. But the % increase is certainly higher since the transistor cost of the stuff that doesn't scale gets amortized further with each doubling. On each iteration the scalable units account for a larger part of the transistor budget so the % increase in die area gradually approaches the % increase in those units. That happens even faster if the unit count increases are accompanied by larger caches, buffers etc.

Say you start with a chip that has 60% scalable and 40% fixed based on die area.

Doubling (100% increase) of the scalable stuff is a 60% increase in area and the resulting chip is now 75% scalable and 25% fixed.
Doubling scalable again is now a 75% increase in area and the resulting chip is 86% scalable, 14% fixed.
 
There are parts of the chip that have *fixed size* and don't scale down with the manufacturing process? I've never heard that before.
 
There are parts of the chip that have *fixed size* and don't scale down with the manufacturing process? I've never heard that before.

You're confusing die size reductions based on manufacturing process with the relative increase in die size for adding units. We are discussing the latter. My use of "scale" isnt related to 55nm vs 65nm. It's in context of RV635 vs RV670 for example.

Edit: Actually I think I missed no-X's question completely. He was asking about G94->G92 scaling vs RV635->RV670 scaling. I dont think you can draw any conclusions about one by looking at the other.
 
This thing about an ATI ppu on the next gpu that is the make or break for amd's future I believe. Trust me this whole ppu news threw me for a curve ball, how is ATI going to support it not havok physx is nvidia. Unless its a MADD MADD WORLD in other words the AMD IS FUSING SSE5 INSTRUCTIONS onto the gpu and updated compliers will now compile for these instructions.



Could it be AMD is unleashing sse5 instructions which a very gpu like in nAture onto the gpu first before bulldozer arrives this is an incredible move if true.



Think I am talking crap well you be the judge and read this link on sse5 instructions whihc are extensions onto the general x86 set.


THE NEXT STEP

Finally, as we're always watching how the predicted merging of CPUs and GPUs is progressing, this is a notable time. Although AMD is not targeting GPUs specifically with SSE5, as we mentioned before GPUs are particularly fast at MADD operations and meanwhile SSE5 is providing CPUs a shot in the arm in that area. This won't kill (or even significantly maim) the GPU, but it is one less thing that the GPU advantage is shrinking in. As SSE iterations keep coming out and implementing features similar to DSPs and GPUs, they will keep chipping away at their side of the barrier between the CPU and the GPU until very little is left and the two become one.

2008 and beyond will see SSE5 support coming to the GCC compiler, along with AMD's optimized software libraries. AMD is going through great efforts to spur the adoption of SSE5.


Thus the ppu on the next gpu will not be accessed via directx but rathe via an updated gcc complier if this is what happens brilliant AMD just brilliant.


June cannot come any sooner
 
You're confusing die size reductions based on manufacturing process with the relative increase in die size for adding units. We are discussing the latter. My use of "scale" isnt related to 55nm vs 65nm. It's in context of RV635 vs RV670 for example.

Edit: Actually I think I missed no-X's question completely. He was asking about G94->G92 scaling vs RV635->RV670 scaling. I dont think you can draw any conclusions about one by looking at the other.
I was talking about G94->G94 scaling vs. theoretical RV670->RV770 (rumoured 480 ALUs / 32 TMUs) scaling.

G94->G92 = ALUs +100%, TFs +100%, TAs +100%, ROPs and MC +0% ~ die space +28% / transistors +49%

RV670->RV770 = ALUs +50%, TFs + 100%, TA + 0%(?), ROPs and MC +0% ~ die space + 32%.

As we expect 7xx MHz clock-speed, I'd say, that RV770's design was targeted for higher transistor density rather than higher clock-speed (other phrenetic theory coud be 1,5x clocked shader domain, which would tend to nice 4,5:1 ALU:TEX when using 480 ALUs and 32 texture units, it would explained ~1GHz rumours, but I doubt...)
 
G92 in comparision to G94 doubles both TMU and ALU count and die-space difference is still lower than ~30%. Does nVidia employs magicians? :smile:

Accordings to my measurements (sliding calliper) in terms of die space G92 is about 36% bigger than G94.

Magic? No, but in contrast to AMD, Nvidia did scale more of their "cheaper" units, leaving the expensive ones alone.

AMD: Fat TMUs (remember: single-cycle FP64 including corresponding datapathcs, additional 4 TAs per Quad-TMU), although I'm not sure, if the L2 is integrated into the TMU-block.

Nvidia: Fat ROPs (8z/clk, etc.), semi-fat ALU-Clusters.
 
I was talking about G94->G94 scaling vs. theoretical RV670->RV770 (rumoured 480 ALUs / 32 TMUs) scaling.

G94->G92 = ALUs +100%, TFs +100%, TAs +100%, ROPs and MC +0% ~ die space +28% / transistors +49%

RV670->RV770 = ALUs +50%, TFs + 100%, TA + 0%(?), ROPs and MC +0% ~ die space + 32%.

As we expect 7xx MHz clock-speed, I'd say, that RV770's design was targeted for higher transistor density rather than higher clock-speed (other phrenetic theory coud be 1,5x clocked shader domain, which would tend to nice 4,5:1 ALU:TEX when using 480 ALUs and 32 texture units, it would explained ~1GHz rumours, but I doubt...)

Oh, ok. But does that comparison tell us anything? We already know that the ROP, TMU and shader hardware are considerably different. And there may also be some other goodies added to RV770 in addition to the increased unit counts. TA's on RV770 should also be +100% too - doubt AMD would go for just free AF, they need more raw bilinear as well.
 
RV670->RV770 = ALUs +50%, TFs + 100%, TA + 0%(?), ROPs and MC +0% ~ die space + 32%.
Increasing TF without increasing TA won't improve performance. You can't use texels in filtering until you've worked out which texels you need to fetch. EDIT: should qualify that's for bilinear. I suppose TA:TF = 1:2 could have a similar effect as in G80. Erm...

Also I think Z fillrate is the top priority. That could be achieved with a doubling of RBEs or by increasing the Z-specific portions of the existing RBE.

As we expect 7xx MHz clock-speed, I'd say, that RV770's design was targeted for higher transistor density rather than higher clock-speed (other phrenetic theory coud be 1,5x clocked shader domain, which would tend to nice 4,5:1 ALU:TEX when using 480 ALUs and 32 texture units, it would explained ~1GHz rumours, but I doubt...)
Because both ALUs and TUs access the register file directly (as far as I can tell) I think they have to be clocked "synchronously".

Jawed
 
donanimhaber.com mentioned "... high performance anisotropic filtering". Increasing number of both texturing units and ALUs shouldn't change relative performance drop of AF. But that is not what I meant. RV670 dispose of 32 adressing units. Would it be impossible to share them to feed 32 bilinear texture filters?
 
Status
Not open for further replies.
Back
Top