R4xx will break Moore's law

WaltC · Aug 12, 2003

Could be wrong but I always thought that Moore's law dealt with computational power, and has since been modified in the retelling to mean "number of transistors"...?

In terms of computational power Moore's law has long since been broken, radically. Eh, can't recall...

I think what also has to be remembered is that this press release comes out of Synopsis and not ATi. Companies like this love to blow their horns. I recall years ago a similar company doing work for 3dfx made a press-release about working on .15 micron designs for 3dfx, and the release was circulated with the interpretation that it meant 3dfx was going to "skip" .18 and move right into .15. Further analysis indicated the remarks were predicated on "some .15 micron libraries" 3dfx was looking at. So probably a grain of salt is indicated...

nonamer · Aug 12, 2003

So... how many transistors will the R420 have?

demalion · Aug 12, 2003

I think there is too much concentration on the pixel processing pipeline, and not enough consideration of the "glue" elements, when considering the R420 and what it might be capable of. Just because the pixel pipeline architecture remains similar, doesn't mean that what is done with the pixel processing functionality has to remain the same. That even seems to fit exactly the NV20->NV30 metaphor, though the success of the implementation, and how ambitious it is required to be on a technical level, are things we (well, most of us) can't really guess at right now.

The pipeline(or at least part of it) seems sufficiently open-ended in design, though the "real-time" several hundred instruction shader mentioned as demonstration of the F-Buffer is the only actual indication I know of that this is might be an accurate description. For a while now, my own theories have revolved around an unfettered (perhaps due to ignorance) imagination with regard to the possibilities offered by taking an F-Buffer approach, and what that means for more than just pixel shaders...IOW, wondering what a good design team could do (and has already done) with regard to overcoming hurdles of latency, scheduling, and flexibility in a data flow management based solution to expanding featureset.

What could be accomplished by using how the F-Buffer implementation problem (efficiency, speed) was solved, to implement new solutions? Why would any solutions necessarily require "many" additional transistors to offer important functionality? Brute force isn't the only type of engineering solution, just the easiest one to come up with.

Of course, ATI may just invent some other unexpected/unrelated solution (something I wouldn't put past them given my outlook on their engineering record), but I'm limited by the information I have access to and what it allows me to envision. How that relates to actuality will be interesting to see.

As for other IHVs:

I'm really curious about PowerVR, too. nVidia, not quite so much...not because I'm "counting them out", but because I think they've indicated their path to evolution pretty clearly (process, clock speed, bandwidth, while solving the problems of the NV3x series and developing the not-so-secret capabilities they envision). 3DLabs might have something interesting too, but I don't have enough info to be curious yet..."floating-point-P10 with better performance" seems a pretty safe bet.

Other IHVs don't seem to be able to offer anything surprising in the near future. Well, not as "surprising" or as significant as the above, I think.

RussSchultz · Aug 12, 2003

Dio said:
RussSchultz said:

doubling the size of your chip will (roughly) half the yield (assuming the yield hit is due to random defects).

Click to expand...

This is wrong.

If defects were random across the whole wafer, doubling the size of each die doubles the number of bad chips, while simultaneously halving the number of total chips. This is a non-linear function.

I don't know anything about foundries, of course. I have no idea even if defects are random or not.

You're right. It actually has a worse effect than halving the yield.

reever · Aug 12, 2003

DaveBaumann said:
Discalimer:
Anyway, I'm wondering if R420 isn't sounding somewhat ambiteous at this point. A new 200M transistor chip done in a year by the same teams as the previous part (9800) - sounds like a tall order.

Didn't ATI say they have different teams working on each new architecture? Or was that just for the R500?

arhra · Aug 12, 2003

WaltC said:
Could be wrong but I always thought that Moore's law dealt with computational power, and has since been modified in the retelling to mean "number of transistors"...?

you've got that backwards

Gordon Moore made his famous observation in 1965, just four years after the first planar integrated circuit was discovered. The press called it "Moore's Law" and the name has stuck. In his original paper, Moore observed an exponential growth in the number of transistors per integrated circuit and predicted that this trend would continue.

source

KimB · Aug 12, 2003

WaltC said:
Could be wrong but I always thought that Moore's law dealt with computational power, and has since been modified in the retelling to mean "number of transistors"...?

No, Moore's "Law" (How in the world can you call an observation of the progression of engineering a Law? Bah...) originally was dealing with an increase in transistor density (double about every 18 months). Its meaning has been modified to mean, colloquially, that computational power will increase by about the same amount.

But Moore's "Law" deals more with process technologies than it deals with processors themselves. Processors cannot "break" this imaginary law. You'd need a radical change in processor design to do that (Which will happen within about four and a half years...for better or for worse).

Anyway, if ATI's new processor has much more than 250 million transistors, as this rumor seems to want to imply, then it will be because it's quite a bit larger than ATI's R3xx cores. That means the yields will be lower.

Joe DeFuria · Aug 12, 2003

DaveBaumann said:
Some people believe the extra trannies will come from increased with of the pipelines, but I get the impression that there will be an increased depth in the pixel shaders - i.e. and 8x1x2 configuation (pipes x texture samplers x pixel shader units). I'd also guess that there will be an optimised stencil rendering path per pipeline as well.

OK, let's do some more "scientific speculation":

First, we'll have to lay out some assumptions. Relatively speaking:

1) Pixel shading pipes are high transistor count, and low in bandwidth consumption

2) Texel reading units are low in transistor count, and high in bandwidth consumption

3) Pixel writing pipes are medium in both respects.

This is a bit fuzzy of course, but it should suffice.

So now lets rank several possible configurations, once based on transistor count, and once based on bandwidth consumption per clock:

Code:

Transistor count (low to high)       Bandwidth Requirements (low to high)
--------------------------------------------------------------------------------
1) 8x2x1                             A) 8x1x2
2) 8x1x2                             B) 12x1x1
3) 12x1x1                            C) 12x1x2
4) 8x2x2                             D) 8x2x1 
5) 12x2x1                            E) 8x2x2 
6) 16x1x1                            F) 16x1x1 
7) 12x1x2                            G) 12x2x1

(Just to be clear on nomenclature, when I say 8x2x2, I take it to mean a total of 8 pixel writing pipes (can write 8 pixels per clock), each of which has 2 texture sample units, and 2 pixel shading units. Thus, 8x2x2 = a total of 16 texture reading units, and 16 pixel shading units.)

One assumption is going to be: at what MHz we expect a 0.13u part from ATI to run at, given 200 million transistors?

Even given ATI's favorable track record in this respect, I think that assuming they will hit 380 Mhz on 0.13 with 200 million transistors is very optimisitc for their first gen loki. I would tone it down to about 300 Mhz.

Next, we have to guess at what type of memory is going to be readily available and usable for a "high end" part in Q1 04. I dunno 600-650 Mhz DDR/(G)DDR II?

That would put bandwidth targets at roughly 1.8x that of the 9800 Pro.

So lets look at some of the configurations....

Given this, I think that 8x1x2 is unlikely. It seems like too much silicon, and not enough fill rate, given the available bandwidth.

I also reject that 12x1x2 on similar grounds. Though there's more bandwidth utilization, it's also considerably more expensive.

I will reject 8x2x1...for the opposite reason: good bandwidth utilization, but not enough silicon to make 200 million trannies IMO. This would also be going intuitively in the wrong direction wrt pixel shading performance.

Next, I'll reject 12x2x1: Only moderate pixel shading gains on one hand, and likely too much fill rate for the given bandwidth on the other. It's just too unbalanced.

So, after tossing out the obvious losers, we're left with:

3) 12x1x1 (Or the "bolt three RV350's together" option).
4) 8x2x2 (The "Take an R350 and double the TMUs and Pixel shading units" option)
6) 16x1x1 (The "bolt 2 R350s together" option...or "bolt 4 RV350s together" option).

Each of these options has pros and cons.

Option 3) (12x1x1). Cons: Non power of 2 number of pipes, and underutilizing potential bandwidth. Pros: already have RV350 on 0.13u. In short: perhaps the least costly and risky part to R&D and manufacture, but with a lower performance return relative to other solutions.

Option 4) (8x2x2). Cons: Requires significant changes to either R350 or RV350 design. Pros: probably the best combination of bandwidth utilization and pixel shading performance increase. Most balanced part, IMO, but probably the most expensive to R&D.

Option 6) (16x1x1). Cons: Silicon cost...most expensive in terms of transistor real estate, and may require more bandwidth than is available to fully utilize the fill rate. Pros: Given enough bandwidth, performance will be top notch in all aspects.

There you have it.

I'm not sure which of those three options I'm favoring at the moment....I'll have to think about it....

V3 · Aug 12, 2003

If they're sticking to DX9, would they consider putting 2 R350 on a single chip ? With some modification maybe ?

McElvis · Aug 12, 2003

Hi Joe,

I remember some rumor somewhere that said that the r420 would have twice the pixel shading power of the r300. That would, if true, count out option 3, but still leave options 4 and 6...

And would option 4 really take up ~100 trannies?

zidane1strife · Aug 12, 2003

But Moore's "Law" deals more with process technologies than it deals with processors themselves. Processors cannot "break" this imaginary law.

Well 1B+ transitor processors could be here by early 2005(about 20Months from now...). Would that break it, or would it be in line with what's expected?

Nebuchadnezzar · Aug 12, 2003

McElvis said:
Hi Joe,
And would option 4 really take up ~100 trannies?

A whole lot more

demalion · Aug 12, 2003

Joe,
I think the bandwidth utilization criteria is good for evaluating design usefulness. Some solid info on memory technologies seems one way to narrow things down, assuming ATI's commitment to balanced design.

I have some doubts on your cost estimates, however, and some of the possibilities you didn't include. For instance, how would reconfiguration capabilities influence your evaluations of transistor count? Also, what if ATI's idea of "balance" has changed...what about an increased emphasis on processing, and the impact of that on the importance of bandwidth utilization in the equation?

I agree that 8x2x1 seems unlikely (unless it is one of multiple behavior descriptions), but I think 12x1x2, 12x2x1, and 8x1x2 were rejected too early, at least based on the stated reasons. I don't think core clock speed and transistor limitations can be estimated so definitevly due to provided reasons, though there might, of course, be others.
Greater speed in non pixel-shaded output doesn't seem that important, especially with an improved AA and AF implementation a possibility of being offered as well. Some flexibility in texture fetching capability seems to coincide with enhanced shader functionality (including vertex shaders), as well as being one rumor your list doesn't seem to consider a possibility.

However, your post seems the best place to move forward from I've seen offered on the topic so far, and a pretty "durned" significant step in this speculation thread, IMO.

EDIT: I abused your nomenclature (adding pixel processing in the middle makes more sense to me for some reason)...edited to fit pipesxTMUsxprocessing units , as fits your discussion.

AAlcHemY · Aug 12, 2003

And what about tile based rendering, coudn't this decrease the power dissipation of the GPU?

Arun · Aug 12, 2003

Joe: My "guesses

" are:

R420: 12x1x1 ( 3 RV350s bolted together - for the pixel pipelines, that is, not for much other stuff )

NV40: 8x0x2 or 8x2x1. Being able to do 16x0x0 is unlikely too. Why? Because as shown by Dave, that trick no longer works as soon as you put in 4x FSAA. And even if they doubled the number of Z units, it wouldn't work with 8x FSAA, which the NV40 is likely to support and focus on ( hopefully better than the NV30 focused on 4x FSAA

)

Wouldn't be surprised if low-end NV4x could double zixel output like the NV3x though, since they most likely will be more focused on 4x FSAA, maybe even 2x FSAA for the extreme low-end. But that depends on nVidia's priorities. But if they allowed double single texturing for the NV31 & NV34, I don't see why they wouldn't allow double Z for the NV41, NV42 and NV43 ( although the NV41 might be still slightly too high-end for that - but we'll see )

Uttar

OpenGL guy · Aug 12, 2003

Joe DeFuria said:
Code:

Transistor count (low to high) Bandwidth Requirements (low to high) -------------------------------------------------------------------------------- 1) 8x2x1 A) 8x1x2 2) 8x1x2 B) 12x1x1 3) 12x1x1 C) 12x1x2 4) 8x2x2 D) 12x2x1 5) 12x2x1 E) 8x2x1 6) 16x1x1 F) 8x2x2 7) 12x1x2 G) 16x1x1

(Just to be clear on nomenclature, when I say 8x2x2, I take it to mean a total of 8 pixel writing pipes (can write 8 pixels per clock), each of which has 2 texture sample units, and 2 pixel shading units. Thus, 8x2x2 = a total of 16 texture reading units, and 16 pixel shading units.)

Interesting table... but I don't see why a 12x2x1 should take less bandwidth than a 8x2x1... I think you need to reverse D and E.

Dave Baumann · Aug 12, 2003

( 3 RV350s bolted together - for the pixel pipelines, that is, not for much other stuff )

You have no idea what you're talking about...

Joe DeFuria · Aug 12, 2003

McElvis said:
Hi Joe,

I remember some rumor somewhere that said that the r420 would have twice the pixel shading power of the r300. That would, if true, count out option 3, but still leave options 4 and 6...

Anything's possible.

However, we have to be very leery of "general" statements like "twice the pixel shading power" of the R300. Specifically, is the claim "twice the power" clock for clock? Or is it just "twice the power?"

For example, My clock speed guess might be completely off. A 12x1x1 at 450 Mhz has exactly twice the "pixel shading power" of a 300 Mhz R300.

Or, the pixel pipelines might be deeper. That is, 1 R420 pixel pipe might be clock for clock 1.33 times "faster" than a R300 pixel pipe. So that a 12x1x1 running at 300 Mhz is indeed twice as fast as an R300 running at 300 Mhz.

And would option 4 really take up ~100 [million] trannies?

God only knows.

Joe DeFuria · Aug 12, 2003

OpenGL guy said:
Interesting table... but I don't see why a 12x2x1 should take less bandwidth than a 8x2x1... I think you need to reverse D and E.

Whoops!

(fixed!) In fact, I did some rearraging of the whole bandwidth ranking table a bit:

Code:

Transistor count (low to high)       Bandwidth Requirements (low to high)
--------------------------------------------------------------------------------
1) 8x2x1                             A) 8x1x2
2) 8x1x2                             B) 12x1x1
3) 12x1x1                            C) 12x1x2
4) 8x2x2                             D) 8x2x1 
5) 12x2x1                            E) 8x2x2 
6) 16x1x1                            F) 16x1x1 
7) 12x1x2                            G) 12x2x1

Dio · Aug 12, 2003

Hmmm. Not sure you can quite analyse bandwidth like that. The worst case bandwidths for 'Nx1x1' and 'Mx2x1' can appear in very different circumstances.

It's very application dependent as to which would consume more bandwidth. For example, if it uses mostly compressed textures - and if it doesn't it bloody well should! moan mumble whinge complain - the extra bandwidth from an extra texture unit isn't likely to be high....

R4xx will break Moore's law

WaltC

nonamer

demalion

RussSchultz

Professional Malcontent

reever

arhra

KimB

Joe DeFuria

V3

McElvis

zidane1strife

Nebuchadnezzar

demalion

AAlcHemY

Arun

Unknown.

OpenGL guy

Dave Baumann

Gamerscore Wh...

Joe DeFuria

Joe DeFuria

Dio

Similar threads