Magnum_Force
Newcomer
I've used the search function and couldn't find a topic with this title, but I'm guessing it probably has been discussed hear and there throughout this forum in bits and piece (some larger than other).
I just want to get the general rundown on that differences between each chip to understand maybe why the things were made the way they were.
Feel free to also add in R600 (I think it'll be unavoidable).
*note: I do not possess the technical knowledge pretty much anyone who frequents this board possess's, most of the stuff I've picked up is from you guys, but take comfort in the fact it really is a case of Great teacher / Bad student if I'm way off base on some things
So, lets start with the stuff we know:
R580:
90nm
384m transistors
352mm2 die size
256bit memory bus (8x32bit memory config)
48 Pixel Shader Processors
16 texture units
16 pixel pipelines
16 Z samples (1 per pipe)
R580 in X1900XTX form:
650Mhz core clock
775mhz memory clock
10400 Mpixel/sec pixel fillrate
10400 Mpixel/sec texture fillrate
10400 Mpixel/sec Z fillrate
20800 Mpixel/sec AA sample rate
1.3 Billion Triangles/sec Geometry rate
49.6 GB/sec memory bandwidth
Other notes: R580 has roughly 20% die size increase over R520 for 3 times more shaders (Register Arrays also tripled in size to accommodate this change). R520 transistor count was 321m, so this increase cost 63m transistors approx.
The 48 pixel shader processors in R580 consist of 2 ALU's, one of which can do 1 Vec3 ADD and 1 scalar ADD, and the other which can do 1 Vec3 ADD/MUL/MADD and 1 scalar ADD/MUL/MADD, as well as the branch execution unit.
The R580 also has a 8 vertex shaders that can do a MADD.
This give the R580 the ability to do 48 ADD instructions, 48 MADD instruction in the pixel shaders, and 8 more MADD instruction s in the vertex shaders.
At 650mhz that is 426.6 GFLOP/sec. (374.4GFlops/sec for Pixel shaders)
RV670/R600
55nm/80nm
666m/720m transistors
192mm2/420mm2 Die size
256bit/512bit memory bus (8x32 bit/ 8x64 bit)
320 "Stream processors" (64*5 superscalar VLIW)
16 texture units
16 RBE's/pipelines
32 Z samples (2 per RBE)
R600 in HD2900XT form:
742mhz core clock
825mhz memory clock
11872 Mpixels/sec pixel fillrate
11872 Mpixels/sec texture fillrate
23744 Msamples/sec Z sample rate
47488 Msamples/sec AA sample rate
742 Million Triangles/sec Geometry rate
105.6 GB/sec memory bandwidth
RV670 in HD3870 form:
775mhz core clock
1125mhz memory clock
12400 Mpixels/sec pixel fillrate
12400 Mpixels/sec texture fillrate
24800 Msamples/sec Z sample rate
49600 Msamples/sec AA sample rate
775 Million Triangles/sec Geometry rate
76.8 GB/sec Memory bandwidth
Ok, so that's that.
Lets take a look at RV670 and R580, because they both share a 256bit bus. Let us first compare the transistor difference: 384m tranny's versus 666m tranny's.
Thats a difference of 282 million transistors.
If we look at FLOP perfromance though, RV670 seems to have actually decreased in performance per clock:
5 MADD's (10 FLOPS) * 64 = 320 (640 FLOPS)
320 * 650mhz (X1900XTX's core clock) = 416 GFLOP/sec
416 GFLOP/sec versus R580's 426.6 GFLOP/sec.
Around 10 GFLOP's less than the previous (or previous's previous) generation.
Oh dear.
Raw number never tell the whole truth though, as we can see from the numerous benchmarks around the web, the RV670 enjoys a performance advantage over R580.
Or Does it!!
Toms hardware graphics chart (hardly the be all and end all of proof, but lets assume it is atleast somewhat accurate) shows some rather harsh results.
Take battlefield 2142. At 1024x768 without AA/AF, the HD3870 enjoys a juicy lead over the X1900XTX, roughly 33fps. The situation changed rapidly once AA and AF are activated though, and at 1920x1200, we can see just 1 fps difference between the 2 cards. I think we can assume the difference is not caused by the things that the RV670 and R580 have in common, due to the fact that the X1950XTX is actually ahead of both HD2900XT and HD3870.
That is only one game though, so lets look at others.
Dark Messiah for Might and Magic shows the advantages of the HD3870. At 1920x1200 with AA and AF, it has a 15 fps advantage over the X1900XTX. Curiously though, at 1600x1200, the X1900XTX has the lead. Could this advantage the HD3870 has be down to bandwidth?? perhaps....
Moving onto Doom 3 and Oblivion, we see the HD3870 has a lead in both titles when AA/AF is on at all resolutions. Oblivion is a shader heavy game, so RV670 seems to be more efficient with it's shaders (makes sense, as it is unified), but it could be down to some other part of the architecture.
Link to Tom's hardware VGA charts below.
http://www23.tomshardware.com/graphics_2007.html?modelx=33&model1=1060&model2=727&chart=275
*On Toms hardware site, it doesn't tell you if HQ-AF is activated on R580 (it is on by default on RV670), so it does make hard to judge performance.
So lets see whats changed.
We know the R580 had a hybrid Ringbus/Crossbar memory system, and the RV670 has a fully fledged Ringbus memory controller. Whether this would have increased or decreased the number of transistors I can't say and I'm not technically educated to make a guess. This is were the other forum members come in
looking at the figures, AA sampling has increased with RV670 over R580, as has Z sampling (I have no idea what Z actually does, but it does give some idea of how little I know about GPU's). What seems odd is that I though RV670 took a bigger hit with AA on that R580 did, although most websites activate both AA and AF together most of the time, making it hard to compare. Even so, the fast clock speeds of the HD3870 should give it an advantage in texture power over X1900XTX.
Ok, well, I'll think I'll leave it there for now, my neck hurts, it's 1.51am, I'm tired, and I don't know enough about this stuff to contribute more unfortunately.
So, people of Beyond 3D forum, let us see if we can find out:
)Why, with in an increase of 282 million transistors, did RV670/R600 decrease overall GFLOP/sec per clock?? Was a unified architecture really that expensive to implement?? was it down to DX10 requirement or tradeoff for better performance in other area's??
)Is a unfied Superscalar architecture superior to a fixed function Vec/scalar architecture??
Also, how come RV670 does have a decent performance advantage over R580 when it has marginal advantage in overall floating point power. Does the combination of higher clock rate (thus increasing all associated attributes) and higher memory bandwidth produce a culumative increase in performance, or can it be mostly attributed to a single/few things?
Anyway, time for some nap time
I just want to get the general rundown on that differences between each chip to understand maybe why the things were made the way they were.
Feel free to also add in R600 (I think it'll be unavoidable).
*note: I do not possess the technical knowledge pretty much anyone who frequents this board possess's, most of the stuff I've picked up is from you guys, but take comfort in the fact it really is a case of Great teacher / Bad student if I'm way off base on some things
So, lets start with the stuff we know:
R580:
90nm
384m transistors
352mm2 die size
256bit memory bus (8x32bit memory config)
48 Pixel Shader Processors
16 texture units
16 pixel pipelines
16 Z samples (1 per pipe)
R580 in X1900XTX form:
650Mhz core clock
775mhz memory clock
10400 Mpixel/sec pixel fillrate
10400 Mpixel/sec texture fillrate
10400 Mpixel/sec Z fillrate
20800 Mpixel/sec AA sample rate
1.3 Billion Triangles/sec Geometry rate
49.6 GB/sec memory bandwidth
Other notes: R580 has roughly 20% die size increase over R520 for 3 times more shaders (Register Arrays also tripled in size to accommodate this change). R520 transistor count was 321m, so this increase cost 63m transistors approx.
The 48 pixel shader processors in R580 consist of 2 ALU's, one of which can do 1 Vec3 ADD and 1 scalar ADD, and the other which can do 1 Vec3 ADD/MUL/MADD and 1 scalar ADD/MUL/MADD, as well as the branch execution unit.
The R580 also has a 8 vertex shaders that can do a MADD.
This give the R580 the ability to do 48 ADD instructions, 48 MADD instruction in the pixel shaders, and 8 more MADD instruction s in the vertex shaders.
At 650mhz that is 426.6 GFLOP/sec. (374.4GFlops/sec for Pixel shaders)
RV670/R600
55nm/80nm
666m/720m transistors
192mm2/420mm2 Die size
256bit/512bit memory bus (8x32 bit/ 8x64 bit)
320 "Stream processors" (64*5 superscalar VLIW)
16 texture units
16 RBE's/pipelines
32 Z samples (2 per RBE)
R600 in HD2900XT form:
742mhz core clock
825mhz memory clock
11872 Mpixels/sec pixel fillrate
11872 Mpixels/sec texture fillrate
23744 Msamples/sec Z sample rate
47488 Msamples/sec AA sample rate
742 Million Triangles/sec Geometry rate
105.6 GB/sec memory bandwidth
RV670 in HD3870 form:
775mhz core clock
1125mhz memory clock
12400 Mpixels/sec pixel fillrate
12400 Mpixels/sec texture fillrate
24800 Msamples/sec Z sample rate
49600 Msamples/sec AA sample rate
775 Million Triangles/sec Geometry rate
76.8 GB/sec Memory bandwidth
Ok, so that's that.
Lets take a look at RV670 and R580, because they both share a 256bit bus. Let us first compare the transistor difference: 384m tranny's versus 666m tranny's.
Thats a difference of 282 million transistors.
If we look at FLOP perfromance though, RV670 seems to have actually decreased in performance per clock:
5 MADD's (10 FLOPS) * 64 = 320 (640 FLOPS)
320 * 650mhz (X1900XTX's core clock) = 416 GFLOP/sec
416 GFLOP/sec versus R580's 426.6 GFLOP/sec.
Around 10 GFLOP's less than the previous (or previous's previous) generation.
Oh dear.
Raw number never tell the whole truth though, as we can see from the numerous benchmarks around the web, the RV670 enjoys a performance advantage over R580.
Or Does it!!
Toms hardware graphics chart (hardly the be all and end all of proof, but lets assume it is atleast somewhat accurate) shows some rather harsh results.
Take battlefield 2142. At 1024x768 without AA/AF, the HD3870 enjoys a juicy lead over the X1900XTX, roughly 33fps. The situation changed rapidly once AA and AF are activated though, and at 1920x1200, we can see just 1 fps difference between the 2 cards. I think we can assume the difference is not caused by the things that the RV670 and R580 have in common, due to the fact that the X1950XTX is actually ahead of both HD2900XT and HD3870.
That is only one game though, so lets look at others.
Dark Messiah for Might and Magic shows the advantages of the HD3870. At 1920x1200 with AA and AF, it has a 15 fps advantage over the X1900XTX. Curiously though, at 1600x1200, the X1900XTX has the lead. Could this advantage the HD3870 has be down to bandwidth?? perhaps....
Moving onto Doom 3 and Oblivion, we see the HD3870 has a lead in both titles when AA/AF is on at all resolutions. Oblivion is a shader heavy game, so RV670 seems to be more efficient with it's shaders (makes sense, as it is unified), but it could be down to some other part of the architecture.
Link to Tom's hardware VGA charts below.
http://www23.tomshardware.com/graphics_2007.html?modelx=33&model1=1060&model2=727&chart=275
*On Toms hardware site, it doesn't tell you if HQ-AF is activated on R580 (it is on by default on RV670), so it does make hard to judge performance.
So lets see whats changed.
We know the R580 had a hybrid Ringbus/Crossbar memory system, and the RV670 has a fully fledged Ringbus memory controller. Whether this would have increased or decreased the number of transistors I can't say and I'm not technically educated to make a guess. This is were the other forum members come in
looking at the figures, AA sampling has increased with RV670 over R580, as has Z sampling (I have no idea what Z actually does, but it does give some idea of how little I know about GPU's). What seems odd is that I though RV670 took a bigger hit with AA on that R580 did, although most websites activate both AA and AF together most of the time, making it hard to compare. Even so, the fast clock speeds of the HD3870 should give it an advantage in texture power over X1900XTX.
Ok, well, I'll think I'll leave it there for now, my neck hurts, it's 1.51am, I'm tired, and I don't know enough about this stuff to contribute more unfortunately.
So, people of Beyond 3D forum, let us see if we can find out:
)Why, with in an increase of 282 million transistors, did RV670/R600 decrease overall GFLOP/sec per clock?? Was a unified architecture really that expensive to implement?? was it down to DX10 requirement or tradeoff for better performance in other area's??
)Is a unfied Superscalar architecture superior to a fixed function Vec/scalar architecture??
Also, how come RV670 does have a decent performance advantage over R580 when it has marginal advantage in overall floating point power. Does the combination of higher clock rate (thus increasing all associated attributes) and higher memory bandwidth produce a culumative increase in performance, or can it be mostly attributed to a single/few things?
Anyway, time for some nap time