Guru3d: 5700XT & 5900 have "8 (2x4) pixel pipelines

Dio said:
By god! The coordinates for the Lost City of Atlantis!

ROFL :LOL:

Plato's ramblings weren't any more simple than those actually ;)

----------------------------------------------

IMHO people need to get over that pipeline crap; the sooner the better. Yes arithmetic efficiency is mighty important and is going to scale even more with next generation VPUs, yet I'm afraid it's going to be hard to define on numbers that are probably readable only in the back end (in relative terms).

If you'd want to calculate at the other f.e. multitexturing fill-rate (do single-texturing fill-rates make much sense anyway these days?), then I don't think a layman like me is unsafe if he multiplies the amount of texture ops with a chip's clockspeed.

Even there the result is a theoretical maximum. Real time efficiency of any architecture can be investigated only in applications/games; the significance of paper specs and theoretical best case scenario should be known.
 
Dio: :LOL:
Hey, stop complaining, in the case of the NV34 the new system would favor ATI ;) Although in the case of the NV30,NV31,NV35 and NV36... errr... well... :p
Actually that depends if you put in the arithmetic units; that'd be a hugely positive thing for the R3xx since it beats the hell out of the NV3x there (more particularly if you consider register usage; good luck explaining that to the mainstream though, obviously).

Demirug: Never knew that about the combiners. Are you sure it can do A*B+C*D in one cycle? If so, that's pretty nice. But I think your NV35/NV36/NV38 info is kinda inaccurate; it should rather be:

NV35: (8T[FP32]|4A[FP32]#1) + (8MUL[FP32]|4MAD[FP32]|4DP4[FP32]|8MUL[FP16]|8MAD[FP16]]|8DP4[FP16])
NV36: (4T[FP32]|2A[FP32]#3) + (8MUL[FP32]|4MAD[FP32]|4DP4[FP32]|8MUL[FP16]|8MAD[FP16]]|8DP4[FP16])
NV38: (8T[FP32]|4A[FP32]#1) + (8MUL[FP32]|4MAD[FP32]|4DP4[FP32]|8MUL[FP16]|8MAD[FP16]]|8DP4[FP16])

That is, peak MUL/ADD/... throughput is identical between FP16/FP32, but the peak MAD throughput is higher. But I'm not sure for DP4 though...


Uttar
 
Isn't that an 'illusion' because it's a 2-texture/ALU-per-pixel-per-clock part? It seems to me that Radeon 8500 can do the same thing (t = MUL A,B; MAD C,D,t takes 2 cycles but only uses half the pipeline resources).
 
Dio said:
Isn't that an 'illusion' because it's a 2-texture/ALU-per-pixel-per-clock part? It seems to me that Radeon 8500 can do the same thing (t = MUL A,B; MAD C,D,t takes 2 cycles but only uses half the pipeline resources).

Don't think so. The idea is that NVIDIA really got 1ADD and 2MUL mini-units in all of their FX9/FX12 register combiners. Thepkrl's NV30 pipeline architecture tests ( http://www.beyond3d.com/forum/viewtopic.php?t=5150&highlight= ) told us that two *independent* MULs can be done on the NV30 in each register combiner:
Code:
rounds:  1.00  1.00  prog: m         (1:m          ) 
rounds:  1.00  1.00  prog: mm        (1:mm         ) 
rounds:  1.01  1.00  prog: mmm       (1:mmm        ) 
rounds:  2.00  1.00  prog: mmmm      (2:mmm,m      ) 
rounds:  2.00  1.01  prog: mmmmm     (2:mmm,mm     ) 
rounds:  2.01  2.01  prog: mmmmmm    (2:mmm,mmm    ) 
rounds:  3.01  2.00  prog: mmmmmmm   (3:mmm,mmm,m  )

5 FX12 MULs can be done in a single clock cycle in each pipeline.
So that's 5*4 = 20MUL/clock.
1 is from the FP32 unit. And both FX12 units are capable of 2 FX12 MULs/clock. So that's 1+2+2 = 5. Makes sense :)

I guess that A*B+C*D is logically possible in 1 clock cycle in FX12/FX9 on the NV2x and NV30/NV31/NV33/NV34 then. Just seems I never made the link between those two facts and thought NVIDIA's architecture was so stupid they could only use that cability in super-specific cases :LOL:


Uttar

P.S.: What about the 8500 BTW? It's not capable of this, I assume? Not that the NV35/NV36 are, either, which is the reason they're often slower on a per-clock basis in DX8 applications.
 
Well I assume we're talking about the GF1 and up (everything till the NV35/NV36) since all of these chips used relatively identical register combiners.
Actually, I don't know about the GF1/GF2 - never knew much about those chips. They got register combiners? Why? And how are they exposed?


Uttar
 
Quitch said:
I think places like Beyond3D need a "Everything you wanted to know about graphics cards, but were afraid to ask" page. I'm sure a lot of reviewers write crap because they know crap. Give some of them a free lesson.

Not just for reviewers but for lurkers/newbies like me!

Trying to fathom this thread isn't something that'd come easy to those of us who has an above normal interest in 3d cards, but lacks the detail knowledge most of the participants show...

Uttar> Thank you for explaining the diff between M|D and MAD... Now some of these numbers start to make sense... Hopefully I'll get the rest on the 4th read-throuhg...
If not, I guess I'll just try to take out my frustrations in Savage... ;-)
 
Hyp-X said:
Uttar said:
Never knew that about the combiners. Are you sure it can do A*B+C*D in one cycle? If so, that's pretty nice.

Yes register combiners can do that since they were introduced in the GeForce 1.

http://oss.sgi.com/projects/ogl-sample/registry/NV/register_combiners.txt

NV30-34 can do that but not NV35-38. These GPU have lost the double MUL ability. At least it was the case with the driver available in september.


Regarding this topic. I hate the word zixel and any word invented to justify marketing. What we need (at least I think that's what we need) is the max pixel throughput, the texturing ability and something like a per pixel lighting fillrate with a fixed overdraw factor. Single texturing fillrate remains very important. If you're playing Colin Macrae rally with a GeForce FX 5950 or with a GeForce FX 5700 Ultra without FSAA you'll have roughtly the same result. Why ? They have the same range of single texturing fillrate. Texturing is still very important. And finally we need to know performances when doing something more complex. Per pixel lighting with overdraw is a good way.
 
Tridam: Don't forget they got EXACTLY the same Vertex Shading throughput too :)
Well, regarding how to define outputs, I'd go with a simple enumeration, including the total ROP count:
NV30: CP=4; P=4; Z=8; ROP=16
NV36: CP=2; P=4; Z=4; ROP=8
R350: CP=8; P=8; Z=8; ROP=16

This should be 100% sufficient to determine the maximum throughput in all "basic" situations; for CP (Complex Pixels; complex doesn't have to mean arithmetic because texturing can take several clock cycles too, of course) you also need much, much more information if you want to determine throughput. Which is very things like Shadermark3 or some other PS testers come in IMO.

Now, another interesting subject of discussion is whether it's viable to test all of this stuff in DirectX or not. While now both NV3x and R3xx drivers should be exposing their capabilities pretty darn well in DirectX, shading-wise at least, I do not believe this was the case at release.

Ideally, you may thus want to test it all in OpenGL proprietary extensions, because the IHV *won't* cheat precision-wise there (such as NVIDIA did by imposing FP16 or even FX12 just about everywhere for the NV30/NV31/NV34 IIRC) and the drivers might be more mature when it comes to interpreting this code. Problem is, that's MUCH more time consuming too, and it requires the benchmark's code to be constantly updated...

Another, more interesting, solution is to include a precision-testing routine in the benchmark. I remember someone showing one on the forums which worked remarkably well (although not with FSAA/AF, which seems pretty normal to me) so it's obviously possible. Regarding shader replacement, you could include IQ tests, but eh...

So I'd say a DirectX-based benchmark application with a precision test included would be pretty near perfect for testing the card's PS performance, but with early driver releases where it might not expose its full capabilities in DirectX yet. Not that it matters, since if you buy the card at the time the benchmark is done, that's the performance you'll get in DirectX no matter what!

IMO, some theorical numbers along with shading numbers so you can understand *why* it's that way can still be interesting. Furthermore, with your per-pixel lighting example, you aren't going to use things like RSQ/COSIN/... - sure, they aren't used super-mega-frequently, but you've got to use at least 10 highly different shaders for it to make any type of sense.

Hyp-X: Thanks for telling me to write the link again, makes sense now ;)
Garibaldi: Well it's obviously possible to explain this in a more simple, although longer way, but that wasn't my goal here :)
Stealth: Nah, just that I already once tried doing a thread on this (although with much less precise ideas) so even though I haven't thought a lot about it each time, I did think about it twice with lots of time inbetween. :p

Uttar
 
Demirug said:
NV30: (8T[FP32]|4A[FP32]#1) + 8 Reg-Combiners[FX12]#2)
NV31: (4T[FP32]|2A[FP32]#3) + 4 Reg-Combiners[FX12]#2)
NV34: (4T[FP32]|2A[FP32]#3) + 4 Reg-Combiners[FX12]#2)
NV35: (8T[FP32]|4A[FP32]#1) + ((8MUL[FP32]|4MAD[FP32]|4DP4[FP32])|8 Reg-Combiners[FX12]#2)
NV36: (4T[FP32]|2A[FP32]#3) + ((4MUL[FP32]|2MAD[FP32]|2DP4[FP32])|4 Reg-Combiners[FX12]#2)
NV38: (8T[FP32]|4A[FP32]#1) + ((8MUL[FP32]|4MAD[FP32]|4DP4[FP32])|8 Reg-Combiners[FX12]#2)

#1 1/SQRT(X) = 2A[FP32]|4A[FP16]
#2 Reg-Combiner:
RGB/XYZ 2MUL|1MAD|1ADD|2DP|1MUL+1DP|1A*B+C*D
A/W 2MUL|1MAD|1ADD|1A*B+C*D
#3 1/SQRT(X) = 1A[FP32]|2A[FP16]

Oh please, spare me from thy mighty gibberish!

Im perfectly happy with:

R300, r350,r360 -> 8
nv35,nv38 -> 4


oh I might add:
RVxxx -> who gives a damn
nvxx other than above mentioned -> Spank me my lord but do not dare speak about those abominations. It's horrifying.

add any number of happy faces to make enough of them to not take me serious please :devilish:
 
What?! Just where do you come up with this crap, Mendel--are you ment...
Mendel said:
add any number of happy faces to make enough of them to not take me serious please :devilish:
Oh... my bad. :p ;)
 
Tridam said:
Hyp-X said:
Uttar said:
Never knew that about the combiners. Are you sure it can do A*B+C*D in one cycle? If so, that's pretty nice.

Yes register combiners can do that since they were introduced in the GeForce 1.

http://oss.sgi.com/projects/ogl-sample/registry/NV/register_combiners.txt

NV30-34 can do that but not NV35-38. These GPU have lost the double MUL ability. At least it was the case with the driver available in september.

Running some tests. I looks like that the double MUL ability is still working wit NV35-38.
 
Demirug said:
Tridam said:
Hyp-X said:
Uttar said:
Never knew that about the combiners. Are you sure it can do A*B+C*D in one cycle? If so, that's pretty nice.

Yes register combiners can do that since they were introduced in the GeForce 1.

http://oss.sgi.com/projects/ogl-sample/registry/NV/register_combiners.txt

NV30-34 can do that but not NV35-38. These GPU have lost the double MUL ability. At least it was the case with the driver available in september.

Running some tests. I looks like that the double MUL ability is still working wit NV35-38.

I'm just finishing running some more tests one more time but I can't see the double MUL working. Tests with PS1.1, PS1.4, PS2.0 and PS2.0_pp.

How are you testing it ?
 
AndrewM said:
Just to make a note, Geforce1/2 have 2 general combiners. Geforce3/4 have 8 GC's.

How does that work then, though?
In the NV2x it's 1x4x2 GCs. In the NV1x, they already work with 4 pipelines (although I'm not sure if they use quads already) - so, how can you have 2 GCs, which is less than 4?

I assume, thus, that the GCs are "outside" and "after" the pixel pipelines? And if you got 4 pixels which can be texured in 1 clock cycle; would the lack of sufficient GCs mean the pixel throughput is halved? Just curious. Unless you're mistaken and it's 4? Just asking, cause I don't quite seem to understand how this would be efficient :)


Uttar
 
Uttar said:
AndrewM said:
Just to make a note, Geforce1/2 have 2 general combiners. Geforce3/4 have 8 GC's.

How does that work then, though?
In the NV2x it's 1x4x2 GCs. In the NV1x, they already work with 4 pipelines (although I'm not sure if they use quads already) - so, how can you have 2 GCs, which is less than 4?

I assume, thus, that the GCs are "outside" and "after" the pixel pipelines? And if you got 4 pixels which can be texured in 1 clock cycle; would the lack of sufficient GCs mean the pixel throughput is halved? Just curious. Unless you're mistaken and it's 4? Just asking, cause I don't quite seem to understand how this would be efficient :)


Uttar

geforce1/2 definately only has 2 general combiners.
 
Uttar said:
Sorry for the double post, was only responding to Dig above...
Walt: Problem is, your view on this favors NV31/NV34/NV36. Because they can't get to their maximum of 4 pixels/clock much of the time....

Here's the thing, though, Uttar...the "pixel per clock" number for a g/vpu tells us only what the absolute limit on pixels rendered to screen per clock is for a given architecture. It's not a concern, relative to this statistic, of knowing that in some cases the absolute limit isn't reached. We know that when multitexturing the R3x0 doesn't do "8" pixels per clock, but at most "4," because of its pipeline organization. But when single texturing is involved we can get an absolute limit of 8 pixels per clock rendered to screen with R3x0. This number we can then directly compare to the absolute limit of 4 pixels per clock for nV3x (single or multitexturing, either way.) These are absolute numbers wholly independent of internal ops, and provide us with some "absolutes" about the architectures that in turn allow us to make some basic deductions as to performance. They are descriptive in a fundamental way, in other words.

Conversely, say that we knew nothing of the pixel pipeline organization relative to a chip. How, then, could we use a peripheral assessment of internal "ops per clock" to ascertain the limit on the number of pixels per clock a given architeture could render to screen? I don't think you could do that, which is why I think an understanding of pixel pipeline organization is fundamental to understanding the architecture a chip is built on. IE, "ops per clock," and "pixels per clock" are oranges and apples, and describe wholly different characteristics of various architectures.
 
Back
Top