DaveBaumann said:
Some people believe the extra trannies will come from increased with of the pipelines, but I get the impression that there will be an increased depth in the pixel shaders - i.e. and 8x1x2 configuation (pipes x texture samplers x pixel shader units). I'd also guess that there will be an optimised stencil rendering path per pipeline as well.
OK, let's do some more "scientific speculation":
First, we'll have to lay out some assumptions.
Relatively speaking:
1) Pixel shading pipes are high transistor count, and low in bandwidth consumption
2) Texel reading units are low in transistor count, and high in bandwidth consumption
3) Pixel writing pipes are medium in both respects.
This is a bit fuzzy of course, but it should suffice.
So now lets rank several possible configurations, once based on transistor count, and once based on bandwidth consumption per clock:
Code:
Transistor count (low to high) Bandwidth Requirements (low to high)
--------------------------------------------------------------------------------
1) 8x2x1 A) 8x1x2
2) 8x1x2 B) 12x1x1
3) 12x1x1 C) 12x1x2
4) 8x2x2 D) 8x2x1
5) 12x2x1 E) 8x2x2
6) 16x1x1 F) 16x1x1
7) 12x1x2 G) 12x2x1
(Just to be clear on nomenclature, when I say 8x2x2, I take it to mean a total of 8 pixel writing pipes (can write 8 pixels per clock), each of which has 2 texture sample units, and 2 pixel shading units. Thus, 8x2x2 = a total of 16 texture reading units, and 16 pixel shading units.)
One assumption is going to be: at what MHz we expect a 0.13u part from ATI to run at, given 200 million transistors?
Even given ATI's favorable track record in this respect, I think that assuming they will hit 380 Mhz on 0.13 with 200 million transistors is very optimisitc for their first gen loki. I would tone it down to about 300 Mhz.
Next, we have to guess at what type of memory is going to be readily available and usable for a "high end" part in Q1 04. I dunno 600-650 Mhz DDR/(G)DDR II?
That would put bandwidth targets at roughly 1.8x that of the 9800 Pro.
So lets look at some of the configurations....
Given this, I think that 8x1x2 is unlikely. It seems like too much silicon, and not enough fill rate, given the available bandwidth.
I also reject that 12x1x2 on similar grounds. Though there's more bandwidth utilization, it's also considerably more expensive.
I will reject 8x2x1...for the opposite reason: good bandwidth utilization, but not enough silicon to make 200 million trannies IMO. This would also be going intuitively in the wrong direction wrt pixel shading performance.
Next, I'll reject 12x2x1: Only moderate pixel shading gains on one hand, and likely too much fill rate for the given bandwidth on the other. It's just too unbalanced.
So, after tossing out the obvious losers, we're left with:
3) 12x1x1 (Or the "bolt three RV350's together" option).
4) 8x2x2 (The "Take an R350 and double the TMUs and Pixel shading units" option)
6) 16x1x1 (The "bolt 2 R350s together" option...or "bolt 4 RV350s together" option).
Each of these options has pros and cons.
Option 3) (12x1x1). Cons: Non power of 2 number of pipes, and underutilizing potential bandwidth. Pros: already have RV350 on 0.13u. In short: perhaps the least costly and risky part to R&D and manufacture, but with a lower performance return relative to other solutions.
Option 4) (8x2x2). Cons: Requires significant changes to either R350 or RV350 design. Pros: probably the best combination of bandwidth utilization and pixel shading performance increase. Most balanced part, IMO, but probably the most expensive to R&D.
Option 6) (16x1x1). Cons: Silicon cost...most expensive in terms of transistor real estate, and may require more bandwidth than is available to fully utilize the fill rate. Pros: Given enough bandwidth, performance will be top notch in all aspects.
There you have it.
I'm not sure which of those three options I'm favoring at the moment....I'll have to think about it....