What can be defined as an ALU exactly?

But mintmaster. The second ALU already existed on the NV4x as well. The addition was the MADD capabilities to the ALU. The pipeline change was not that radical from the Nv4x. You shouldnt expect 2x the performance.
 
Last edited by a moderator:
ChrisRay said:
Considering the minimal performance hit for Anisotropic Filtering in these titles on G70 hardware. The secondary texture unit cannot be stalling that much. I have a feeling that the increased ALU to Tex rate in most games that it will likely hide the latency associated it. I believe Ailuros/Demirug has more information on this paticular subject the but primary ALU will never stall completely. But the numbers I seem to recall hearing were in the range of 30-40% in a worse case scenerio. If I'm doing any misquoting those two can feel free to correct me regarding the percentage range.

Chris

Another gentleman from 3DCenter named Coda measured a worst case scenario of 20% but that's not small either. Demirug definitely would have more data on it and my memory is pretty vague on the conversation. If someone would want to stall it it shouldn't be too hard IMHO, question is if such code would behave that much better on competing products.
 
Mintmaster said:
I don't think so. Just look at per-pipe performance of NV40 versus G70. They're very similar, even with complex lighting shaders. The biggest improvement is 23%, but mostly we're talking about single digit gains. In contrast, for several of these shaders R580 gets 2-3x the performance of R520.

NVidia may have the theoretical capability, but somehow the second ALU doesn't do very much. From the data, a G70 shader pipe is a lot closer to one ATI shader pipe than two.

Anyway, a lot of this talk is moot because the R5xx architecture is nowhere near as dense as G7x. Just wait until the 7600 comes.


http://www.digit-life.com/articles2/video/r580-part2.html
 
Yes, I saw that. What's your point? The G70 scores are for 550MHz, 24 shader pipes and 24 texture units.

In the Cook-Torrence lighting test, G70's shader pipe is 10% faster than R520's.
In the 3-light Blinn test, G70's shader pipe is 32% faster than R520's.
In the parallax mapping test, G70's shader pipe is 5% faster than R520's.
In the frozen glass shader test, G70's shader pipe is 12% slower than R520's.
Steep parallax mapping and fur (PS3.0) were blowouts, naturally.
(All scores are 32-bit. These are the toughest shaders they threw at the cards. The first three show >2x boost on R580 over R520)

Clearly the differences are much closer to zero than 100% or even 50%. Counting MADD issue rate is a much poorer gauge of arithmetic performance than multiplying shader pipelines by frequency.
 
ChrisRay said:
But mintmaster. The second ALU already existed on the NV4x as well. The addition was the MADD capabilities to the ALU. The pipeline change was not that radical from the Nv4x. You shouldnt expect 2x the performance.
I know. That's why comparing MADD rate is rather silly.
 
Then how do you draw the conclusion the second ALU doesnt do very much? The additional MADD capabilities were an improvement to the second ALU. But its always been there. I dont see how you have come to the conclusion that Nvidias second ALU isnt doing much. ((or I just misread what you were saying)) As the shader throughput of the Nv4x and G70 has always been fairly exceptional IMO. And the G70 is just more efficient in many cases due to an improvement to the primary ALU.
 
Last edited by a moderator:
The net result is that G70 features 48 fragment shaders of the same capabilities, with one of them having to handle the texture processing instructions.
Yeah, that's true; one of the fragment shaders/ALUs handles texture processing instructions per pixel processor, but since there are 24 pixel processors on G70 and they're counting units based on the total number of pixel processors available, the statement should read:
The net result is that G70 features 48 fragment shaders of the same capabilities, with 24 of them having to handle the texture processing instructions.
 
Last edited by a moderator:
Ailuros said:
Moderator's note: This discussion was split from the another thread.
Let's continue the discussion about ALUs in a more appropriate forum, and in a more appropriate thread too.




G70 has 48 ALUs.
How dare you :LOL:
Oh and you should know better than to say the G70 has 48 alus.. since you could say the R580 has 96 :D
Since I was talking about full alus and you knew that, or should have, you should not have an issue with me saying the G70 has 24.
 
Last edited by a moderator:
Mintmaster said:
Yes, I saw that. What's your point?

My point since my initial post was that counting ALUs (depending on perspective) like most do is nonsense, without analyzing the deeper aspects of each pipeline and yes I know it didn't come as clearly across as I would had wanted to.

I myself admit to have stepped into the trap initially of not encounting ADDs on R5x0, until I got several reminders over it and sat down with a friend and had it analyzed for me.

G70 was/is definitely no slouch in terms of floating point performance especially for the timeframe it was released.

Since the cutout of this thread doesn't include the initial comment I reacted to, here once more:

How else do you think they were able to see that they needed a 48 shaderpipeline part while nvidia's still messing around (for future products) with a little 24 shaderpipe part and at best a 32?

Bon apetit, if you look at architectures in that fashion :rolleyes:
 
Luminescent said:
Yeah, that's true; one of the fragment shaders/ALUs handles texture processing instructions per pixel processor, but since there are 24 pixel processors on G70 and they're counting units based on the total number of pixel processors available, the statement should read:

The net result is that G70 features 48 fragment shaders of the same capabilities, with 24 of them having to handle the texture processing instructions.

You're not suggesting that those latter 24 cannot do anything else besides handling texture OPs in parallel are you?
 
Small reminder: MADD+MADD is very unlikely to ever happen on G70 because of register file restrictions. The original NV40 was (MUL or TEX) + MAD. The primary advantage of G70's MAD+MAD is being able to do single-cycle LERPs like ATI, although only when there's no texturing (SUB&MAD), and having a lot more flexibility when it comes to instruction reordering and Vec2+Vec2/Vec3+Vec1 optimizations (part of the advantages in the G70 pipeline for that last point are, however, afaik unrelated).

Personally I would tend to believe that in 3:1 ALU:TEX ratio games, it is a reasonable estimation that to say one of NVIDIA's 24PS pipeline is equivalent to one of ATI's 48PS pipelines. This is because NVIDIA's pipelines can do VERY slightly more per clock, and you can roughly imagine the texturing operation every 3 clocks wasting that back.

Now, on the other hand, if you decrease the ALU:TEX ratio, NVIDIA's texturing abilities increase while their arithmetic ones decrease, which gives them an obvious advantage. So below that 1:3, you'd conceputalize each of NVIDIA's pipelines to do more and more than ATI's "pipelines", up until the theorical point of 1:0 and below where it'd become a (24/16) performance ratio between NVIDIA and ATI (DX7-era games, and some DX8-era ones).

Now, what's more interesting is what happens when the ALU:TEX goes beyond 3:1. Interestingly enough, NVIDIA's ALU1 gets less and less asked to do texture addressing, so their arithmetic power per-pipeline begins to surpass that of ATI's more. Obviously, they won't reach the equivalent of ATI's 48 pipelines, but perhaps 28-30 quite easily. Which obviously is why NVIDIA doesn't get beaten by 2-2.5x in purely arithmetic tests. Obviously, 3:1 is NVIDIA's weakness, but it gets less dramatic not only below that rato, but also above it.

As for what happened to the G71 then, and G73 at the same time: do consider the fact G70's scheduler was changed, compared to NV40's, in order to divide the batches between PS pipelines... Now obviously, that cost them transistors, and the only use we can see of it right now is slightly less disastrous dynamic branching performance. But think about what flexibility that gives you, when it comes to future desgins...

Sadly, I haven't had any reliable confirmation of this train of thought, so just take it as speculation for the time being.


Uttar
 
Ailuros said:
You're not suggesting that those latter 24 cannot do anything else besides handling texture OPs in parallel are you?
No, I'm not suggesting this. SU0 can perform a 4 component DIV or MADD and other combinations when there are no tex ops in a given clock; however it is limited during a tex instruction (perhaps due to the fact that the texture fetch unit has limited connections to the dispatch unit and must use the data pathways of the MADD units), although the MADDs and SFU might be able to modify the texture data before it's sent to the texture fetch unit (assuming my previous assumption about the reason for SU0's limitation during tex ops is correct).

Perhaps a more accurate rendition of the statement would be:
The net result is that G70 features 48 fragment shaders of the same capabilities, with 24 of them required for texture instructions.
The statement does not mention that the 24 are completely limited to tex ops, since it is presupposed that they can go beyond tex ops (i.e., if all 48 have the same capabilities, albeit for tex modification, and are fragment shaders, they can go beyond tex ops). In addition, it doesn't specify that the 24 of them are required for tex ops all the time.
 
Last edited by a moderator:
Uttar said:
Small reminder: MADD+MADD is very unlikely to ever happen on G70 because of register file restrictions.
I disagree. While the four FP32s restriction does make dual-issued FP32 MADDs pretty unlikely (two operands need to be shared by both MADDs), partial precision MADDs seem fairly viable (there are plenty of them in 3DMk06) and won't bust the register bandwidth limit even when all operands are different.

The original NV40 was (MUL or TEX) + MAD. The primary advantage of G70's MAD+MAD is being able to do single-cycle LERPs like ATI, although only when there's no texturing (SUB&MAD), and having a lot more flexibility when it comes to instruction reordering and Vec2+Vec2/Vec3+Vec1 optimizations (part of the advantages in the G70 pipeline for that last point are, however, afaik unrelated).
2+2 and 3+1 are features of NV40, too, so it's a fairly subtle advantage in this respect for G7x.

Personally I would tend to believe that in 3:1 ALU:TEX ratio games, it is a reasonable estimation that to say one of NVIDIA's 24PS pipeline is equivalent to one of ATI's 48PS pipelines. This is because NVIDIA's pipelines can do VERY slightly more per clock, and you can roughly imagine the texturing operation every 3 clocks wasting that back.
I tend to agree, but NVidia's architectures seem to be more sensitive to overall register count - as long as the register count is no more than about 4 or 5 FP32s then they're OK. So they become very much dependent on being able to use _PP to maintain performance. Which seems viable as shorter shaders prolly won't reveal FP16-precision errors.

Now, on the other hand, if you decrease the ALU:TEX ratio, NVIDIA's texturing abilities increase while their arithmetic ones decrease, which gives them an obvious advantage. So below that 1:3, you'd conceputalize each of NVIDIA's pipelines to do more and more than ATI's "pipelines", up until the theorical point of 1:0 and below where it'd become a (24/16) performance ratio between NVIDIA and ATI (DX7-era games, and some DX8-era ones).
Yep, I think this is where the heavy advantage for NV40 and G70 fragment pipelines comes from, with so few games having much arithmetic intensity.

Now, what's more interesting is what happens when the ALU:TEX goes beyond 3:1. Interestingly enough, NVIDIA's ALU1 gets less and less asked to do texture addressing, so their arithmetic power per-pipeline begins to surpass that of ATI's more.
Generally I agree - the NVidia pipeline appears "more flexible", able to gracefully trade texturing and ALU proportions. But I think the true cost transpires in heavy register (and/or FP32 precision) usage.

The only other thing that's worth noting is that the 3:1 ALU:TEX thing has become a little muddled, as far as I can tell. ATI was recommending 3:1 for R420. To me this means that R580 needs about 9:1 to flourish. The 3:1 ratio in R420 seems to be a function of the latency-hiding capability of the fragment pipeline (i.e. thread size), with the partially decoupled texturing providing a fair degree of texture "pre-fetching", though limited by R420's "stalling" upon dependent texturing. With fairly intensive texturing in most games, I think it's fair to say R420 prolly never saw much in the way of 3:1 until, ahem, after R520 had released, and so analysis of this point in respect of R420 hasn't happened...

Well, that's my interpretation, anyway.

Obviously, they won't reach the equivalent of ATI's 48 pipelines, but perhaps 28-30 quite easily. Which obviously is why NVIDIA doesn't get beaten by 2-2.5x in purely arithmetic tests. Obviously, 3:1 is NVIDIA's weakness, but it gets less dramatic not only below that rato, but also above it.
Truely intense arithmetic tests seem to be all over the shop:

http://www.digit-life.com/articles2/video/3dmark06/3dmark06_11.html

which shows a 35% advantage per fragment pipe for G7x.

The two PS3 tests (Steep Parallax Mapping and Fur) show the opposite, though:

http://www.digit-life.com/articles2/video/r580-part2.html

27% and 18% advantage per pipe in favour of R580 - but they prolly make use of dynamic branching as a performance tweak.

The PS2 tests on that page, Parallax Mapping and Frozen glass show a heavy dependency on _PP for G70. In FP32, though, the former shows a 35% advantage for G70 while the latter shows a 79% advantage.

(7800GTX-512 assumed to be 550MHz and R580 assumed to be 650MHz.)

Jawed
 
Ailuros said:
ps.gif


http://www.beyond3d.com/previews/nvidia/g70/index.php?p=02

You may ask Wavey to send his check back first :p

This is the marketing diagramm

The “technical“ version looks different:



More details (but other names for the units):

 
Uttar said:
Now, on the other hand, if you decrease the ALU:TEX ratio, NVIDIA's texturing abilities increase while their arithmetic ones decrease, which gives them an obvious advantage. So below that 1:3, you'd conceputalize each of NVIDIA's pipelines to do more and more than ATI's "pipelines", up until the theorical point of 1:0 and below where it'd become a (24/16) performance ratio between NVIDIA and ATI (DX7-era games, and some DX8-era ones).

Won't the limitation of the ROPs kick in at some point in that scenario tho?
 
geo said:
Won't the limitation of the ROPs kick in at some point in that scenario tho?

Only in cases were you have only one texturelayer and need only the bilinear filter. But in this cases you will have a bandwidth problem too.
 
geo said:
Won't the limitation of the ROPs kick in at some point in that scenario tho?
No, not when multitexturing/using better filtering. I'd rather fear the memory bandwidth personally then, though (unless the textures are sufficiently low-res & compressed, and the resolution is sufficiently high to improve framebuffer compression - both of which are in fact very likely when playing an old game on a new high-end card).

Uttar
EDIT: Damn, Demirug beat me to it ;)
 
Back
Top