R580 versus RV670

Magnum_Force

Newcomer
I've used the search function and couldn't find a topic with this title, but I'm guessing it probably has been discussed hear and there throughout this forum in bits and piece (some larger than other).

I just want to get the general rundown on that differences between each chip to understand maybe why the things were made the way they were.

Feel free to also add in R600 (I think it'll be unavoidable).

*note: I do not possess the technical knowledge pretty much anyone who frequents this board possess's, most of the stuff I've picked up is from you guys, but take comfort in the fact it really is a case of Great teacher / Bad student if I'm way off base on some things :p

So, lets start with the stuff we know:

R580:

90nm
384m transistors
352mm2 die size
256bit memory bus (8x32bit memory config)
48 Pixel Shader Processors
16 texture units
16 pixel pipelines
16 Z samples (1 per pipe)

R580 in X1900XTX form:

650Mhz core clock
775mhz memory clock
10400 Mpixel/sec pixel fillrate
10400 Mpixel/sec texture fillrate
10400 Mpixel/sec Z fillrate
20800 Mpixel/sec AA sample rate
1.3 Billion Triangles/sec Geometry rate
49.6 GB/sec memory bandwidth


Other notes: R580 has roughly 20% die size increase over R520 for 3 times more shaders (Register Arrays also tripled in size to accommodate this change). R520 transistor count was 321m, so this increase cost 63m transistors approx.

The 48 pixel shader processors in R580 consist of 2 ALU's, one of which can do 1 Vec3 ADD and 1 scalar ADD, and the other which can do 1 Vec3 ADD/MUL/MADD and 1 scalar ADD/MUL/MADD, as well as the branch execution unit.

The R580 also has a 8 vertex shaders that can do a MADD.

This give the R580 the ability to do 48 ADD instructions, 48 MADD instruction in the pixel shaders, and 8 more MADD instruction s in the vertex shaders.

At 650mhz that is 426.6 GFLOP/sec. (374.4GFlops/sec for Pixel shaders)

RV670/R600

55nm/80nm
666m/720m transistors
192mm2/420mm2 Die size
256bit/512bit memory bus (8x32 bit/ 8x64 bit)
320 "Stream processors" (64*5 superscalar VLIW)
16 texture units
16 RBE's/pipelines
32 Z samples (2 per RBE)

R600 in HD2900XT form:

742mhz core clock
825mhz memory clock
11872 Mpixels/sec pixel fillrate
11872 Mpixels/sec texture fillrate
23744 Msamples/sec Z sample rate
47488 Msamples/sec AA sample rate
742 Million Triangles/sec Geometry rate
105.6 GB/sec memory bandwidth


RV670 in HD3870 form:
775mhz core clock
1125mhz memory clock
12400 Mpixels/sec pixel fillrate
12400 Mpixels/sec texture fillrate
24800 Msamples/sec Z sample rate
49600 Msamples/sec AA sample rate
775 Million Triangles/sec Geometry rate
76.8 GB/sec Memory bandwidth



Ok, so that's that.

Lets take a look at RV670 and R580, because they both share a 256bit bus. Let us first compare the transistor difference: 384m tranny's versus 666m tranny's.

Thats a difference of 282 million transistors.

If we look at FLOP perfromance though, RV670 seems to have actually decreased in performance per clock:

5 MADD's (10 FLOPS) * 64 = 320 (640 FLOPS)

320 * 650mhz (X1900XTX's core clock) = 416 GFLOP/sec

416 GFLOP/sec versus R580's 426.6 GFLOP/sec.

Around 10 GFLOP's less than the previous (or previous's previous) generation.

Oh dear.

Raw number never tell the whole truth though, as we can see from the numerous benchmarks around the web, the RV670 enjoys a performance advantage over R580.

Or Does it!!

Toms hardware graphics chart (hardly the be all and end all of proof, but lets assume it is atleast somewhat accurate) shows some rather harsh results.

Take battlefield 2142. At 1024x768 without AA/AF, the HD3870 enjoys a juicy lead over the X1900XTX, roughly 33fps. The situation changed rapidly once AA and AF are activated though, and at 1920x1200, we can see just 1 fps difference between the 2 cards. I think we can assume the difference is not caused by the things that the RV670 and R580 have in common, due to the fact that the X1950XTX is actually ahead of both HD2900XT and HD3870.

That is only one game though, so lets look at others.

Dark Messiah for Might and Magic shows the advantages of the HD3870. At 1920x1200 with AA and AF, it has a 15 fps advantage over the X1900XTX. Curiously though, at 1600x1200, the X1900XTX has the lead. Could this advantage the HD3870 has be down to bandwidth?? perhaps....

Moving onto Doom 3 and Oblivion, we see the HD3870 has a lead in both titles when AA/AF is on at all resolutions. Oblivion is a shader heavy game, so RV670 seems to be more efficient with it's shaders (makes sense, as it is unified), but it could be down to some other part of the architecture.

Link to Tom's hardware VGA charts below.

http://www23.tomshardware.com/graphics_2007.html?modelx=33&model1=1060&model2=727&chart=275

*On Toms hardware site, it doesn't tell you if HQ-AF is activated on R580 (it is on by default on RV670), so it does make hard to judge performance.


So lets see whats changed.

We know the R580 had a hybrid Ringbus/Crossbar memory system, and the RV670 has a fully fledged Ringbus memory controller. Whether this would have increased or decreased the number of transistors I can't say and I'm not technically educated to make a guess. This is were the other forum members come in :D

looking at the figures, AA sampling has increased with RV670 over R580, as has Z sampling (I have no idea what Z actually does, but it does give some idea of how little I know about GPU's). What seems odd is that I though RV670 took a bigger hit with AA on that R580 did, although most websites activate both AA and AF together most of the time, making it hard to compare. Even so, the fast clock speeds of the HD3870 should give it an advantage in texture power over X1900XTX.

Ok, well, I'll think I'll leave it there for now, my neck hurts, it's 1.51am, I'm tired, and I don't know enough about this stuff to contribute more unfortunately.

So, people of Beyond 3D forum, let us see if we can find out:

)Why, with in an increase of 282 million transistors, did RV670/R600 decrease overall GFLOP/sec per clock?? Was a unified architecture really that expensive to implement?? was it down to DX10 requirement or tradeoff for better performance in other area's??

)Is a unfied Superscalar architecture superior to a fixed function Vec/scalar architecture??

Also, how come RV670 does have a decent performance advantage over R580 when it has marginal advantage in overall floating point power. Does the combination of higher clock rate (thus increasing all associated attributes) and higher memory bandwidth produce a culumative increase in performance, or can it be mostly attributed to a single/few things?

Anyway, time for some nap time :)
 
The 48 pixel shader processors in R580 consist of 2 ALU's, one of which can do 1 Vec3 ADD and 1 scalar ADD, and the other which can do 1 Vec3 ADD/MUL/MADD and 1 scalar ADD/MUL/MADD, as well as the branch execution unit.

The R580 also has a 8 vertex shaders that can do a MADD.

This give the R580 the ability to do 48 ADD instructions, 48 MADD instruction in the pixel shaders, and 8 more MADD instruction s in the vertex shaders.

At 650mhz that is 426.6 GFLOP/sec. (374.4GFlops/sec for Pixel shaders)
If we look at FLOP perfromance though, RV670 seems to have actually decreased in performance per clock:

5 MADD's (10 FLOPS) * 64 = 320 (640 FLOPS)

320 * 650mhz (X1900XTX's core clock) = 416 GFLOP/sec

416 GFLOP/sec versus R580's 426.6 GFLOP/sec.

Around 10 GFLOP's less than the previous (or previous's previous) generation.

Oh dear.
RV670 has far more shader power than RV580. Plus, since the R580 doesn't have a unified shader, the 8 VS units don't help you process pixels when you are shader limited. If you are vertex processing limited, then the R580 gets to use its 8 VS units to full potential, but then RV670 gets to use all of its shaders for a, roughly, 40x advantage.

If you look around, there are some more detailed descriptions of how R580 works, and you haven't nailed it.
 
I'm sure you just gave an ATI engineer a nice joke for around the office. Also, yay for old style game engines that barely use shader power. Lets try an Unreal 3 powered game and see what happens :) .
 
Lol, I guess I have above all proved I have no idea what I'm talking about :p.

I guess I started this thread just to see if the dicisions they made with R600/RV670 were wise ones. I mean, if they justed "beefed up" the R580 architecture with more shader power, more rops etc, would it have been more powerful (but obviously less versatile) for todays games??

R600 is roughly 2xR580 in transistor count, but when it comes to real world performance, it rarely beats X1900XTX in crossfire (with AA/AF on) when a game is capable of taking advantage of it.

Also, wouldn't using the R580 architecture on the latest process tech (55nm) be better than using RV635, they have the same transistor count, but the performance of the R580 stomps all over the RV635 (obviously RV635 has features the R580 doesn't so that must be taken into account).

Does anybody have a rough estimate on how much more math power R600/RV670 has over the R580??
 
It's easy, if you sum the MADD throughput on PS and VS arrays (for R580), but bear in mind that you are comparing vector-based architecture to a super-scalar one, so it's a kind of case dependent comparison.
 
)Why, with in an increase of 282 million transistors, did RV670/R600 decrease overall GFLOP/sec per clock?? Was a unified architecture really that expensive to implement?? was it down to DX10 requirement or tradeoff for better performance in other area's??

I'm interested in this question too actually. I do know that the unified architecture requires a lot more logic for thread scheduling and in R600s case specifically even more so to keep it's vector based ALU clusters occupied for optimal efficiency. I large number of transistors were probably spent here in the move to a unified arch.

Is a unfied Superscalar architecture superior to a fixed function Vec/scalar architecture??

Yes. No doubt about it. There's a reason both NVs and ATIs parts use this setup. The main advantages are far greater flexibility in terms of the shader code u can write and the increased efficiency of having shader blocks that can execute on both vertex and pixel based data. OpenGL guy sorta covered this one in his post. And on that note...

OpenGL guy said:
RV670 has far more shader power than RV580.

What do you mean by this? As magnum force pointed out if your just counting pure FLOPs R580 actually can do more per clock then R600/RV670. Are you just referring to real world situations in which a unified arch generally gets shader code through much faster?

Also, wouldn't using the R580 architecture on the latest process tech (55nm) be better than using RV635, they have the same transistor count, but the performance of the R580 stomps all over the RV635 (obviously RV635 has features the R580 doesn't so that must be taken into account).

Probably better for consumers but companies like to have the same feature set across their whole lineup. This way OEMs can say it's DX10 compatible and whatnot while using super cheap cards. It has some other advantages too, for example if there are tons of low end DX10 cards out devs are more likely to target DX10 for future games pushing the industry as a whole. Although this is sort of a moot point as these low end cards tend to be too slow to actually run any DX10 code at playable frame rates.
 
What do you mean by this? As magnum force pointed out if your just counting pure FLOPs R580 actually can do more per clock then R600/RV670. Are you just referring to real world situations in which a unified arch generally gets shader code through much faster?
I mean exactly what I said: RV670 has far more shader power than R580. Even if you just consider pixel shader power alone, RV670 > R580.

Take ShaderMark. It's a test that's pretty much completely pixel shader limited. Compare the results of R580 to RV670 and you find that RV670 is typically 35-40% faster, but in some cases 100% faster!
 
OpenGL guy said:
I mean exactly what I said: RV670 has far more shader power than R580. Even if you just consider pixel shader power alone, RV670 > R580.

Take ShaderMark. It's a test that's pretty much completely pixel shader limited. Compare the results of R580 to RV670 and you find that RV670 is typically 35-40% faster, but in some cases 100% faster!

well what I was getting it is that if the shader instructions were set up to fully exploit both architectures, (in R580's case a 50/50 split between MADD and ADD instructions) you would see throughput echoing magnum force's numbers on each of the chips.

My point is that saying RV670 has more shader power then R580 isn't really correct. RV670's unified architecture is far more flexible and adaptable to different arrangements of instructions and hence is better able to keep it's ALU's active giving the performance advantage we see in the real world.
Effectively more shader power would be a better use of words IMO. lol
 
Yeah power might be a misleading concept since there's the other big fat input that needs to be considered => efficiency. We are still running the first generation of unified shader architectures. So chips like G80 and R600 paid the startup costs related to high control and scheduling overhead. As these things scale and the control blocks take up a lower percentage of available transistor budget the advantage of unified architectures will be obvious.
 
Yeah power might be a misleading concept since there's the other big fat input that needs to be considered => efficiency. We are still running the first generation of unified shader architectures. So chips like G80 and R600 paid the startup costs related to high control and scheduling overhead. As these things scale and the control blocks take up a lower percentage of available transistor budget the advantage of unified architectures will be obvious.

Well I'd say it's pretty obvious already! Anyways any idea on what sort of transistor budget we're talking about for scheduling on RV670/R600?
 
well what I was getting it is that if the shader instructions were set up to fully exploit both architectures, (in R580's case a 50/50 split between MADD and ADD instructions) you would see throughput echoing magnum force's numbers on each of the chips.
And still R580 would be slower than RV670. Using Magnum's numbers:
R580 = 374.4GFlops/sec for Pixel shaders
RV670 = 416 GFlops/sec for Pixel shaders (at R580 clocks)

If you're fully utilizing the pixel shaders, chances are the vertex shaders aren't fully utilized. The numbers about shows that at the same clocks, with the ideal instruction mix for R580, RV670 has a 10% advantage. In reality, the advantage is skewed much more towards RV670.
My point is that saying RV670 has more shader power then R580 isn't really correct. RV670's unified architecture is far more flexible and adaptable to different arrangements of instructions and hence is better able to keep it's ALU's active giving the performance advantage we see in the real world.
Effectively more shader power would be a better use of words IMO. lol
Whatever.
 
As for the transistor budget increase, I want to mention the tessellation engine and improved video engine as clients for part of that. If we don't know how much these took we don't know what part of the transistor budget was purely dedicated to improving the 3D architecture in R580 to a comparable one in R600 and beyond.

Even if we knew that, we'd still have to face the redesign necessary to comply with DX10 requirements (and I'm not thinking of unified architecture, since DX10 didn't actually require the hardware to be unified, as Nvidia claimed loud and clear before revealing G80). So it's a kinda pointless comparison. You'd need to remove the extra parts and then refactor R580 into a card that could handle DX10 and then only you could compare R600's unified architecture vs your modified R580's.
 
I have no idea what Z actually does, but it does give some idea of how little I know about GPU's

Don't feel bad about it (but feel free to feel stupid after you find out; I did), I didn't know what Z was either until I stumbled upon this thread.
_________________________________
This was supposed to be my signature but either I'm blind or not yet allowed one so... Who's bright idea was it to remove TV-out from the 780G :devilish:.
 
Magnum steadies himself in his chair, flex's his neck and the head butts the desk right where a Large Z symbol is painted, he does this repeatedly while saying "D'oh!".
 
Yeah power might be a misleading concept since there's the other big fat input that needs to be considered => efficiency. We are still running the first generation of unified shader architectures. So chips like G80 and R600 paid the startup costs related to high control and scheduling overhead. As these things scale and the control blocks take up a lower percentage of available transistor budget the advantage of unified architectures will be obvious.
SECOND generation according to ATi. Don't forget the C1!
 
Back
Top