48 shaders vs 32 pipeline?

compres said:
So it would have more alus than a g71?

That ALU count stuff confused me way too often too. Let's just say that given the amount of instructions that can be issued by R580, the G7x would need a higher than so far predicted clock frequency (presupposition one assumes 8 quads).
 
Rys said:
Nope, those mini ALUs are for PS1.4 modifiers and instructions like RSQ, if memory serves. Only the main ALU pairs can issue a MUL (and NV40 can issue MUL from both, too).
It's very confusing in NVidia architectures.

As far as I can tell you have the dual-issue capable MAD of the two shader units (or MUL and MAD in NV40).

Then there is the 16-bit normalise dedicated ALU.

There appears to be a dedicated RCP ALU too.

Finally, in NV40, the second shader unit's alpha channel (often thought of as the scalar component) can sacrifice its MAD capability in favour of performing a complex scalar operation, which is where RSQ and the like get executed.

I don't know (doubt) if G70 can make both its shader units' alpha channels perform complex scalar operations...

I'm basing this on p487 of GPU Gems 2, which refers to the 6 series. It's chapter 30 that you can download freely from NVidia.

Jawed
 
@ compres:

Gah, counting ALUs is madness in the first instance. Counting just PS ALUs is nuts in the second. Then we have no idea what G71 is yet, so talking about it in that way, as if it's a certainty, is fairly bonkers. No chip has a 'half' ALU either, just sub units with differing capabilities.

The only thing worth saying, and even that assumes no PS ALU internal changes in terms of instruction issue, and 32 of them, for the NVIDIA chip we're expecting soon, is that R580 will have the ability to issue more ADDs and MULs per cycle. I sigh at myself as I type that, too.

The theoretical issue rate talk is disjointed and utterly mental at the moment, since nobody seems to be on the same page from one post to the next (including me).
 
Ailuros said:
given the amount of instructions that can be issued by R580, the G7x would need a higher than so far predicted clock frequency (presupposition one assumes 8 quads).

...if the shader load should indeed be that high and there were no other bottlenecks ;)
 
_xxx_ said:
...if the shader load should indeed be that high and there were no other bottlenecks ;)

That's true, but the theoretical increase in arithmetic efficiency isn't something to just sneeze over.
 
N00b said:
Which games will that be? UE3 based ones? S.T.A.L.K.E.R? I doubt that we will see so many shader heavy games this year.
Extrapolating from past benchmarks and current rumors (R580 @ 700 48 Shaders 11000+ 3DM05, G71 @ 700 MHz 32PP) I don't see how the R580 will be competitive with the G71 if this year's games are comparable to last years games unless there are any other architectural improvements to R580 that we don't know about yet.

I don't see there will be any shader heavy games this year(2006).

So, probably the increased arithmatic power of R580 won't have much use in games this year.
 
Are any of the current games shader heavy?

What about COD2 for example?

What are the bottlenecks in some of the popular games at the moment?
 
mjtdevries said:
Are any of the current games shader heavy?

What about COD2 for example?

What are the bottlenecks in some of the popular games at the moment?

If we look at:

http://www.extremetech.com/article2/0,1697,1896689,00.asp

We can see the x1600xt (with the 5.12 cats) doing well against the 6800GS in COD2 and F.E.A.R - this indicates that the lack of texture units is more or less non-issue in these games (the same goes for 3Dmark05).

In the rest of the games the lack of texture units does seem to be a bottleneck.
 
Ailuros said:
I read in some GPGPU experiments about the claimed 83 GFLOPs for R520 and reversed the math in order to reach 16 * 4 MADDs/clk. So it's in the order of 120-something in reality?

VS:

vec4+scalar (all madds) -> 10 flops/ALU x 8 VS x 0.625 Ghz
~ 50 GFlops

PS:

Primary ALU:

vec3+scalar (all madds) -> 8 flops/ALU x 16 PS x 0.625 Ghz
~ 80 Gflops

Secondary ALU:

vec3+scalar (all non-madds) -> 4 flops/ALU x 16 PS x 0.625 Ghz
~ 40 Gflops

Total 1800 XT @ 625 MHz ~ 170 Gflops, 32bit programmable (peak madds, muls, adds)
 
I would think the benefit of texture units is directly related to bandwidth available. Maybe ATI doesn't see memory improving enough for their refresh? Pixel Shading has always been quite light on bandwidth needs. It just sounds like 48 shaders is more efficient use of the overall scheme of things.
 
Jaws said:
VS:

vec4+scalar (all madds) -> 10 flops/ALU x 8 VS x 0.625 Ghz
~ 50 GFlops

PS:

Primary ALU:

vec3+scalar (all madds) -> 8 flops/ALU x 16 PS x 0.625 Ghz
~ 80 Gflops

Secondary ALU:

vec3+scalar (all non-madds) -> 4 flops/ALU x 16 PS x 0.625 Ghz
~ 40 Gflops

Total 1800 XT @ 625 MHz ~ 170 Gflops, 32bit programmable (peak madds, muls, adds)


That's what I meant just without the VS.
 
Here's a graph of pixel fill-rate per MFLOP in FEAR, seemingly running with no AA and no AF:

b3d44.gif


It's based solely on pixel shader FLOPS as I don't think vertex shader FLOPs are particularly relevant. It uses the simplistic "12 FLOPs per ATI pipe, 16 FLOPs per NVidia pipe" view of the world.

I've assumed the GTX-512 is running at 550MHz.

The data is sourced from:

http://www.anandtech.com/video/showdoc.aspx?i=2649&p=9

Jawed
 
Am I correct to infer that since the ATI parts lie above the nvidia parts on this chart, that indicates that their pixel shader ALUs are better utilized?

Why is the XT higher than the XL? Faster memory?
ERK

Edit: Brainstorming to answer my own question... the difference between ATI/nvidia could be for a number of reasons, such as the 16:12 assumption, or the fact that one of the nvidia ALUs is busy with texturing. I suppose if one included the tex address calc as an ALU then nvidia and ATI would be about even (unless my brain is not working atm).
 
Last edited by a moderator:
The reason I did no analysis was because I wanted to make the point that comparing FLOPs across architectures is pointless.

Within an architecture, yeah - there's plenty of food for thought there, especially comparing the two architectures' scaling.

I suspect FEAR is backbuffer bandwidth limited, for what it's worth. I'm not convinced there's much use in doing an analysis based on ALUs and FLOPS. Any answer you can come up with in that sense is an answer you'd derive more easily from a synthetic benchmark. And we already know that PS synthetics are useless for predicting game performance.


If you plug the numbers in for X850XT (also in that review), it comes out looking really bad:
504.6 557.7 581.5 567.1

even though the ALU structure is the same between R420 and R520, it's clearly not the entire story in FEAR. So is that the ultra threaded scheduler? The texture cache efficiency? The memory efficiency?...

Jawed
 
Speaking of complicating analyses, does the SM2 X850 even run the same path/shaders as the other, SM3 cards?

If FEAR is backbuffer b/w limited, does that mean it should fly on Xenos?
 
Pete said:
Speaking of complicating analyses, does the SM2 X850 even run the same path/shaders as the other, SM3 cards?
If FEAR has any SM3 code in it, it's been kept awfully quiet.

If FEAR is backbuffer b/w limited, does that mean it should fly on Xenos?
Hard to know for sure, but Condemned apparently runs very smoothly on XB360 with 4xAA - both games are roughly the same tech - though the resolution on XB360 will be lower. I'm not sure if it's true, but supposedly Condemned defaults to soft shadowing which is a big drain on PC.

Jawed
 
The reason I did no analysis was because I wanted to make the point that comparing FLOPs across architectures is pointless.

While that is true, it's not that much different with any other maximum theoretical numbers. I could say that two GPUs are equipped with let's say 20GB/s raw bandwidth; if one is an IMR with no efficient bandwidth saving capabilities and the other one a TBDR, it's obvious which will make better use of the available bandwidth.

I'm trying to find a somewhat reasonable basic measurement/formula for today's and future GPUs, since multitexturing fill-rates become more and more irrelevant.
 
I'm not sure if it's true, but supposedly Condemned defaults to soft shadowing which is a big drain on PC.

Depends also on the implementation of soft shadows. Compare the performance drop in Chronicles of Riddick vs. FEAR f.e.
 
Back
Top