Chalnoth said:
demalion said:
The NV40 technical information doesn't indicate "two full (without limits) shading units per pipeline" either, just the marketing material (that said the same thing for the NV35, remember?). Why, in your perception of possibility, is this required for the R420 in order for it to compare favorably?
No, the NV40 doesn't have two full shading units. Apparently the structure is:
SU1: can execute a mul or special function and a 16-bit nrm.
SU2: can execute a mad, mul, or add.
Yeah, congrats, you read it, I guess? Now read about the R3xx. Hold both thoughts in your mind, and then let only the logical things out through the keyboard, if you are able.
Current shader benchmarks put this architecture at ~20% faster per pipeline per clock than the Radeon 9800 XT on average, when operating in full FP32.
What average are you talking about, and do the metrics that go into it isolate pixel shading from bandwidth and vertex processing advantages? Your using the word "average" when picking and choosing info, doesn't make that figure any less imaginery.
So what's your excuse for ignoring that this advantage isn't universal and that even the R3xx should be able to reverse this situation in some cases? It seems you don't want to recognize this, because it just might indicate that minor changes could conceivably change the balancing point of the "average" case.
Look, the NV40 architecture is excellent, but your selective perception makes discussion of how this compares to any other IHV a useless exercise while you continue to display your bias blinders at every turn.
This is a shame and unfair to the NV40, because even though it doesn't make sense for the only possibility you'll accept (that it is impossible that "non-dramatic" changes to the R3xx pipeline, and therefore possible for the R420,
might achieve something with more effective general case throughput than NV40), it does, AFAICS, indicate that the NV40 throughput picture is advantageous in comparison to the R3xx for quite a significant body of shaders. This is no small achievement of the NV40 engineers, nor was delivering an option for PS 3.0 features and achieving 16 pipelines at the same time. You polarize people into attacking the NV40 by making a mockery of comparing it to other architectures to fit things into your worldview.
I really think that ATI would have to something dramatically different to get more than ~20% faster per pipeline per clock with the R420.
Than the R3xx? No, they wouldn't, at least not in the sense of "dramatic" as you use it to propose they can't reasonably achieve it in R420. And you again confuse the idea of efficency of throughput by treating "20%" as requiring some sort of 20% transistor increase, and ignoring how the NV40's greater efficiency than NV35, with more features, more than double the pipelines, and less than double the transistors, illustrates the flaw with that.
Would only adding more effective dot product op throughput be "dramatically" different? It would double throughput in dot product sequences, and allow operation of a non dependent op to remove a cycle of cost from normalization. The first case would seem to deliver advantage over NV40, the second
sometimes significantly reduce the advantage of an NV40 feature that is restricted to partial precision, and put it ahead in PS 2.0 full precision.
How about reducing the effective clock cycle cost of some key operations that might currently be more than one clock cycle? Are there any for the R3xx? Why wouldn't such a tweak be feasible? The engineers have already indicated pure FP32 would have been feasible if they deemed it necessary, and you still preclude that a design could be closely based on the R3xx yet be significantly different in characteristics.
Since the R420 doesn't appear to be a dramatic departure from the R3xx, I doubt that it will have that much greater efficiency, particularly not at the rumored transistor counts.
You make nearly every conversation involving ATI and nVidia a useless exercise. Until, of course, it comes time for nVidia to accomplish something similar, and then "realizing" your prior error.
Congrats.
Let's see if we can shift time/alter the universe to suit you, and allow useful conversation here (if only I had known the NV40 was coming so it could have been this easy when I discussed the NV30's problems with you!
): so, when it comes time to refresh the NV40, is nVidia as precluded from "dramatically" changing the NV40 architecture to achieve a higher throughput than the NV40, even within an unclear in specifics but similarly limited transistor budget to the NV40?
Would it then enter your head to consider the significance of their methodology of counting transistors, to think that maybe a significant portion of their transistor budget might conceivably remain untouched by improvements that increase throughput for pixel processing?
Well?
So, if you read the above,...
Umm...remember, it was me that provided a link to my discussing the info about the NV40 you recognized "above"...you were proposing that two fully independent ALU ops per clock would be necessary to show advantage to the NV40 prior to that. You didn't provide any new info about why your R420 versus NV40 dictates make sense.
EDIT: Well, your info doesn't quite match the info I'm thinking of, actually. Where is it from?
...you should notice that I roughly expect that when running FP32, I expect the NV40 to approximately achieve parity with the R420 in shader ops.
Because, of course, what ATI can achieve is limited by what nVidia did, nevermind any side issues?
But the NV40 has the added advantage of additional FP16 functional units, and so I expect the NV40 to pull ahead when partial precision is used in appropriate places for special functions (i.e. rsq, nrm, which are commonly used in lighting).
Ah, no, it is because what ATI can achieve is limited to having no advantages compared to what nVidia did, but the converse doesn't apply. I see.