trinibwoy said:
What you've outlined above is exactly why per-shader performance is a much better metric than per-ALU performance since the numbers aren't obfuscated by the intricate architectural details.
I believe I was originally looking more at "per-shader" performance earlier rather than per ALU, but people seemed to want to look at per-ALU for some reason.
What you've still not addressed is your justification for considering R580 a 16-shader part, while at the same time considering G70 a 24-shader part.
Funny - I thought I had addressed that, but I'll try again.
We want to compare the respective architectures, and for the purposes of comparison we want to divide them up into chunks called 'shaders', each of which can execute a pixel shader program.
What do you need to have in order to run a basic pixel shader program?
- To be able to run a generic pixel shader program you require ALU and texture resources (and potentially flow control etc., but let's keep things relatively simple)
So a 'unit' to run a pixel shader program needs ALU and texture resources, so I divide the respective architectures up evenly into chunks that meet the criteria and call them 'shaders'. This results in the following divisions -
R580 - 16 "shaders", each with 1 texture resource and 3 ALU resources
G70 - 24 "shaders", each with 1 texture/ALU resource and 1 dedicated ALU resource.
Dividing things in this way doesn't necessarily have anything to do with the underlying architecture - it's just a way to form a basis for comparison. I guess you could actually pick _any_ basis, as long as your assumptions are consistent and correct, and perform a comparison.
For example, if you want to look at the ALU only case we could choose to discount the texture resources entirely and choose to say that G70 has 48 ALU's and R580 has 48 ALUs, which is what I did in the earlier post about the Cook-Torrance shader performance.
The problem then obviously comes back to the more thorny area of the debate which is "what is considered an ALU?". A lot of people seem to feel that it's an individual block that contains a MAD unit, which was how I framed it, but as you correctly point out this is not the only possibility by any means.
And if you want to get into the discounting game, considering R580 a 48 ALU part still discounts the ADD of the first ALU
I wasn't playing a discounting game - I was playing a simplifying game.
When I initially started comparing things I was pretty content to work at a more abstract level - take each ALU chunk from each architecture as a black-box and simply compare the apparent execution characteristics on the supplied shaders. I was not really looking for
why A was faster or slower than B, or whether A is more expensive in silicon than B, which are also interesting questions to answer.
Mainly the interest seemed to be in comparing the performance characteristics of the ALUs of the two designs, which is what I did. Examining the exact tradeoffs that went into the different designs and their detailed behaviour would be far more complex. Both architectures have "ALUs" that are more complex and perform more operations than a simple MAD.
I think that's a very generous trade for Nvidia's mini-ALU.
Whether it's a generous trade or not would depend on what the full capabilities of that mini-ALU are (and things like the NRM unit, of course). We could equally say that allowing particular ALUs to run at 16-bit precision and comparing to 32-bit precision is also a generous trade in the opposite direction, and for like-like comparison we should always stick to 32-bit precision only (which is fine by me, by the way...
).
I agree that it's just very difficult to do this analysis in a 'fair' manner generally, and I certainly see your point about the additional capabilities on R5xx ALUs, but I just don't see how we can count us as having an extra ADD without counting an extra NRM, for instance. If you want to say that G70 has only 24 ALUs then I guess you can do that, but then you are just reversing the problem, because you are then saying it's fair to equate something like -
(MAD + miniALU_X + NRM + MAD + miniALU_Y)
as a functional unit to:
(miniALU_Z + MAD)
I guess the one thing that we can say for sure is that any way in which we choose to frame this
someone is going to feel hard done by.