So, do we know anything about RV670 yet?

So? Even if the crappy "feature above everything" AA implementation and unproportionally low TMU power...
TMU power is one of those related benefits. High-clocked ALUs = fewer ALUs = more die-space for TMUs. And I don't think R6xx has so inadequate texturing power, as many sources implies. Performance w/o AA is quite good in many titles. HD3870 with it's 16 TMUs is expected to be only 10-15% slower, than GF8800GT with 56 of them. Having 3,5-times more texturing units and perform not even 1,5x faster is quite a proof, that RV670 isn't as much TMU limited, as many thinks.

The biggest problem for R600 is slow resolve and 2 multi-samples/clock.
 
I hope R700 will see weakness of R600 and don't follow the same direction.
I think R700 will end up looking remarkably similar to R600 in terms of the details. ATI has invested a vast amount in the memory bus (internal + external), virtualised resources (everything, this is a big deal), shader-program scheduling and multiple concurrent contexts. All this stuff, I think, only kicks into high gear in a multi-chip GPU - but hey maybe I'm just working backwards from the rumoured "answer".

Anyway, I'm convinced that multi-chip is where ATI was aiming, all along, so you have to read R600 in those terms.

Jawed
 
I think R700 will end up looking remarkably similar to R600 in terms of the details. ATI has invested a vast amount in the memory bus (internal + external), virtualised resources (everything, this is a big deal), shader-program scheduling and multiple concurrent contexts. All this stuff, I think, only kicks into high gear in a multi-chip GPU - but hey maybe I'm just working backwards from the rumoured "answer".

Anyway, I'm convinced that multi-chip is where ATI was aiming, all along, so you have to read R600 in those terms.

Jawed

Agreed!

My concern was for R700 improvement over R600:
A. Separate clock frequency for Stream Processing Units separated from GPU clock.
B. Increased texture units count
C. Increased ROP count.
D. And Instead 64x5=320 streams, but 128x3=384 streams. (Similar to ATI R580 3:1 ratio)
 
To make it simple, their is not much ATI can do with R6xx series GPU's, just like Nvidia with Geforce FX (NV30/NV35)

I hope R700 will see weakness of R600 and don't follow the same direction.




Actually Voodoo5 5000/5500 was not so bad. :)

Actually, the Voodoo5 (Which had those lovely 1 or 2...and kinda 4...gpu options) was late to market because it was tuned for multi gpu performance, based on an older venerable architecture (VSA-100) and therefore killed by the Geforce 2, instead of it's intended competition the POS Rage Fury Maxx (which I regretfully have somewhere in a closet) and the Geforce two-fiddy-six.. IOW, I relate it for the fact we'll see 1 and 2 gpu units for Rv670 and it may already be last-gen performance if Nvidia rolls out a G100 supporting dx10.1 in Q1, which has been the steadfast rumor. R700 might even further this trend...A sad and scary thought, but possible.

As for the other part of your post, I agree. I hope it's not just (essentially) 1/2/3/4xRV670 with a 128/256/384/512-bit bus on 45nm. That would truly be a shame.
My concern was for R700 improvement over R600:
A. Separate clock frequency for Stream Processing Units separated from GPU clock.
B. Increased texture units count
C. Increased ROP count.
D. And Instead 64x5=320 streams, but 128x3=384 streams. (Similar to ATI R580 3:1 ratio)

We can all pray for as such.


Anyway, I'm convinced that multi-chip is where ATI was aiming, all along, so you have to read R600 in those terms.

I thought I was alone in this regard.
 
Last edited by a moderator:
And where is the data that shows "rather bad utilization in comparison"? (Lest we forget that we can actually access our peak rates as well, where other designs still appear to have "MULs-missing" for much of the time...)
Well, we have shader tests. Digit-life has a pretty decent test array. R600 pulls through with the expected 2.1x performance over the GTS once, but usually it doesn't.

The missing MUL "inefficiency" is a red herring. If NVidia didn't tell us about it, we wouldn't know. For the digit-life tests, I just look at it as 96 SPs at 1.2GHz vs. 320 SPs at 742MHz.

Now, to be fair, ATI trimmed R600 by a substantial margin down going to the seemingly faster RV670. Despite the 55nm process used, G92 is a lot bigger than RV670, so while R600 was less efficient than G80 from both a transistor/die-space and shader op persepective, RV670 has shed the former characteristic.
 
TMU power is one of those related benefits. High-clocked ALUs = fewer ALUs = more die-space for TMUs. And I don't think R6xx has so inadequate texturing power, as many sources implies. Performance w/o AA is quite good in many titles. HD3870 with it's 16 TMUs is expected to be only 10-15% slower, than GF8800GT with 56 of them. Having 3,5-times more texturing units and perform not even 1,5x faster is quite a proof, that RV670 isn't as much TMU limited, as many thinks.
http://www.ixbt.com/video3/g80_units2.shtml
even on G80, some games load upto TMUs upto 75% of time...
 
R600 has a 4:1 ALU to texture ratio, not 5:1 as you seem to be indicating.

Correct!


Regarding the number of shaders, Nvidia has 128 Shader units, while the R600 only 64.

R600 features 64 Shader 4-way SIMD units. But the final output gives 320.
 
Regarding the number of shaders, Nvidia has 128 Shader units, while the R600 only 64.
I don't know how you came up with these numbers. First of all, the number of units between G80 and R600 aren't even direclty comparable due to differences in clocking, layout, etc.
R600 features 64 Shader 4-way SIMD units. But the final output gives 320.
How does 64*4 = 320? The chip doesn't work the way you seem to think.

G80's ALUs are more scalar than R600, but R600 has plenty of math power for most tasks.

no-X said:
Having 3,5-times more texturing units and perform not even 1,5x faster is quite a proof, that RV670 isn't as much TMU limited, as many thinks.
I won't comment on RV670 (whatever that is ;) ), but it sure is interesting that G80's "64" texture units don't perform 4x as fast as R600's 16.
 
I don't know how you came up with these numbers. First of all, the number of units between G80 and R600 aren't even direclty comparable due to differences in clocking, layout, etc.

How does 64*4 = 320? The chip doesn't work the way you seem to think.

G80's ALUs are more scalar than R600, but R600 has plenty of math power for most tasks.

The problem is how ATI adverting HD-2900XT, there is not actually 320 stream processors on that chip, there is only 64 real processors, but each is cable of 5 operations per shader clock. The 320 individual stream processing units in R600 are arranged in 4 groups of 80 SIMD arrays and each functional unit is arranged as a 5-way superscalar shader processor. First; most of the stream processors are simpler and aren't capable of special function operations. For every block of five stream processors, only one can handle either a special function operation or a regular floating point operation. The special function stream processor is also the only one able to handle integer multiply, while others can perform simpler integer operations. This means is that each of the five stream processors in a block must run instructions from one thread.
Although the unified shader concept is similar between the two cores, the way they go about presenting this functionality is a bit different. (Whereas the G80 has 128 aptly-named Unified Shaders), the R600 has 320 Stream Processors. Clearly 320 is a bigger number than 128, but as we know in the hardware world, bigger numbers don't always mean something is better. The fact of the matter is that Stream Processors are different than Unified Shaders. ATI's Stream Processors are an integral part of the Superscalar architecture implemented on the R600. Those 320 processors on the R600, but some of them are standard ALU's and some of them are special-function ALU's.
In contrast, NVIDIA's G80 has up to 8 groups of 16 (128 total) fully generalized, fully decoupled, scalar, stream processors, but keep in mind the SPs in G80 run in a separate domain and can be clocked as high as 1.5GHz. In ATI's R600, each functional SP unit can handle 5 scalar floating point MAD instructions per clock. And one of the five shader processors can also handle transcendental as well. In each shader processor, there is also a branch execution unit that handles flow control and conditional operations and a number of general purpose registers to store input data, temporary values, and output data.

TMU's and ROP's holds R600 back due to because not much space left on 80nm tech. Since the chip using lots of transistors that cause increasing size and complexity of the chip and the wafers on which chips are made are fixed in size and if you have a chip with lots of transistors, it takes up lots of space, and you can't make so many of them from one wafer.
 
How does 64*4 = 320? The chip doesn't work the way you seem to think.
Why does all the marketing material say that R600's shader pipes have 5 scalar units each, with one of them capable of special functions? Even the B3D review says they achieved 5 MAD operations per shader pipe per clock.

I won't comment on RV670 (whatever that is ;) ), but it sure is interesting that G80's "64" texture units don't perform 4x as fast as R600's 16.
They do when they're the limiting step. Here are some examples:
http://www.digit-life.com/articles2/video/r600-part2.html

Procedural wood and water shaders are running almost twice as fast on the GTS ("48" texture units at 500MHz).
 
Um, describing the architecture of an ATI GPU to a senior ATI guy doesn't seem a bit redundant :?:

Why does all the marketing material say that R600's shader pipes have 5 scalar units each, with one of them capable of special functions?
Because its 64 * (4 + 1 that does special function)=320. Not 64*4.

TMU's and ROP's holds R600 back due to because not much space left on 80nm tech.
Counter argument is that ATIs core speed (ie ROPs & TMUs) run at a higher clock than NVs.
NV has spent transistors on lots of lower speed TMU/ROPs & saved on ALU transistors with the much faster ALU clock.
ATI saved on TMU/ROPs with higher core clock but has to use more ALUs because they run at same core clock.
 
Why does all the marketing material say that R600's shader pipes have 5 scalar units each, with one of them capable of special functions? Even the B3D review says they achieved 5 MAD operations per shader pipe per clock.
He was correcting Shtal, who was the one to say 4-way, unless my eyes deceive me. As you mention, I had no trouble whatsoever hitting the peak instruction rate (320 inst/clock) is trivial on R600, but then the shaders for that are pretty trivial too.

That leads on to the hardest challenge the architecture has running real apps for me, which is having the instruction assembler feed it properly.
 
Best argument as always is who does the latest games the fastest at similar or equal levels of quality.

In that case the buyer doesn't really care about clock domains or ROP's.
 
The problem is how ATI adverting HD-2900XT, there is not actually 320 stream processors on that chip, there is only 64 real processors, but each is cable of 5 operations per shader clock.
So now you're saying it's 64*5, which is better. R600 has 320 ALUs working per engine clock, G80 has 128 ALUs per shader clock. Would that be a reasonable assessment? If you are able to take advantage of 320 ALUs per clock, wouldn't that be something to advertise? In general, we see very good utilization of our ALUs. I don't know why the term "stream processor" was chosen over "ALU" or some else, I am not a marketing guy.

(Much stuff I already know about R600 deleted.)
I am confused by your statements. You insist on comparing 128 to 320, but then you admit that they aren't comparable due to differences in clocking, etc. So why bother to compare them at all?

What is the point you are trying to make about R600 v. G80?
TMU's and ROP's holds R600 back due to because not much space left on 80nm tech. Since the chip using lots of transistors that cause increasing size and complexity of the chip and the wafers on which chips are made are fixed in size and if you have a chip with lots of transistors, it takes up lots of space, and you can't make so many of them from one wafer.
Except that G80 is larger than R600 so you can make even fewer G80's per wafer than R600s. In other words, I don't understand the point of this paragraph.
 
Why does all the marketing material say that R600's shader pipes have 5 scalar units each, with one of them capable of special functions? Even the B3D review says they achieved 5 MAD operations per shader pipe per clock.
And what is the problem? Is it hard to accept that the special function unit can do MADs as well?
 
Well lets get this back on track...
What kind of improvements can we expect from the 3870 over the 2900? From what's been leaked is it possible to expect as much as 10 FPS in certain games? Also, (most importantly) will the 3870 be a reasonable upgrade for those with 2900XT like the GT is for GTX owners?
 
That leads on to the hardest challenge the architecture has running real apps for me, which is having the instruction assembler feed it properly.

In actual game shaders we see the ALU population as being very high.
 
Well lets get this back on track...
What kind of improvements can we expect from the 3870 over the 2900? From what's been leaked is it possible to expect as much as 10 FPS in certain games?

One would expect the story of 3xxx Vs 2xxxx to somewhat mimic that of GT vs G80 - faster in shader-limited cases, slower in bandwidth limited ones.
 
Back
Top