Chalnoth, two things:
- This question:
Chalnoth said:
Ostsol said:
Here's a thought: do you think the GeforceFX's floating point capabilities would be so limited as they are if NVidia decided not to include support for FX12?
At the same time, do you think the FX would have the same peak shader processing power if not for the support of FX12?
is based on the NV30's FX12 decision not being a mistake, but necessary for performance.
And it directly contradicts this assertion:
Chalnoth said:
Well, now that the NV35 architecture apparently shows very little performance drop when changing all FX12 ops to FP16 ops, it is conceivable that those FX12 units were updated to FP16 units.
...
It seems simple: fp16 and fp32 is a good decision, fx12 and fp16 and fp32 was not. To me, it seems illogical to simultaneously propose that "FX12 was necessary and not wasteful" and "the NV35 is able to improve a similar design significantly", and isn't even consistent with what nVidia themselves has recognized.
I'm not sure at all why a persistence in defending FX12 with greater transitor count, higher clock speed, and many shader limitations being necessary to show advantage versuse the R300 made much sense before, and I don't understand at all how you propose it makes any sense now when a viable alternative in the same family is apparently ready to be delivered within similar transistor budget: the NV35. The NV35 just simply seems to remove question of whether the NV30 was a wasteful design, unless there are hidden drawbacks (doesn't seem too likely).
- Another comment:
Chalnoth said:
Well, now that the NV35 architecture apparently shows very little performance drop when changing all FX12 ops to FP16 ops, it is conceivable that those FX12 units were updated to FP16 units.
Why do you still persist in concentrating on the peak performance of the NV3x (NV35 in this case), ignoring the limitations affecting its ability to reach its peak, ignoring the peak performance of the R3xx (which, btw, is 16 ops if you want to ignore limitations), and then concluding that "shader performance will still be higher than an 8 PS per clock architecture"?
How common is vec3/scalar non dependent op occurrence?
How often does it need to access textures?
How important is granularity of optimization opportunity?
What role does parallelism play in accomplishing the workload?
How significant a role do instruction execution performance differences play in performance for a particular shader?
All these questions are very important and directly relevant for comparison, and they are questions you consistently ignore when you state "12 versus 8" in what seems to me to be a useless fashion...you're quoting maximum versus minimum, and surely you must realize that such a comparison is atleast slightly biased?!
This is before we get into discussing things like:
- The NV30 introduces an additional dependency on using FX12 to reach the peak...why is FX12 good at all? Surely you can atleast recognizethat the NV35's ability to do better within similar transistor budget is a much better design, so why defend FX12 on a similar design and transistor count?
- The impact on performance of exceeding register usage restrictions...perhaps likely to have some atleast minor impact on long shaders?
...
For some other (not self-contradicting) statements:
DemoCoder,
Why do we need integer datatypes for looping with integer processing being a subset of floating point processing? I'm assuming the answer is efficiency of hardware utilization for indexing such registers. If I have that correct, why are we looking at integer processing as the only way to efficiently address this looping? How about the idea of being able to use a separate scalar operation instead of requiring a full 4 component unit to be tied up for the same loop incrementing operation? Isn't that even more efficient than a separate 4 component integer unit that can only do the same op for all components? This example is regarding existing designs, R3xx versus NV30-NV34.
Further, if this is indeed efficient, what sense does it make to add an extra integer processing unit for looping, when you can just have the same unit capable to be used for other scalar opportunities, including floating point, by simply allowing it to handle multiple data types on input and output instead of being restricted. This example is regarding actual R3xx, versus hypothetical R3xx with integer only units or hypothetical NV35-alike with the ability to split off separate scalar/vec 3 ops (assuming the architecture doesn't make that too difficult).
Is it just a matter of not considering that the R3xx limitation is only in processing, not input and output, and that its fp processing is only "wasted" for one clock cycle and wastes, at worst, fp24 vec3 for that pipe and clock? I.e., viewing them as "FP only" when they could also be viewed as FP/INT/VEC3/SCALAR "flexible".
Isn't the waste over the entire scene what is significant? Having units that are "int only" and "vec4 only" seems to me to be more likely to be wasteful, and focusing only on the "waste" of "FP only" seems to be ignoring that AFAICS.
...
My view on the NV3x is that it is a "bust" for PS 2.0 in comparison to the R3xx, pure and simple. I do see the possibility of good advantage with data creation operations like SINCOS for aiding in reaching peak throughput, with the NV35 specifically, for long shaders and trying to minimize texture fetches (in contrast to fixed point dependency, which to me seems more likely to depend on a greater proportion of texture usage for effects...the specular normalization for Doom 3 being an example of this, I think), but that's the only advantage and depends on ignoring significant disadvantages as well. Maybe if the NV35 has vec3/scalar optimization opportunity, hidden thus far for some reason, it could excel.
What I think is that the R3xx is a, relatively, poor design approach to build upon for the full PS 3.0/VS 3.0 spec, and that the NV3x approach is a better one. But the NV3x doesn't
do the full PS 3.0/VS 3.0 spec, and is a, relatively, poor design for what it does do. I think that nVidia got caught flat-footed in taking their own time in transitioning to the goal of PS/VS 3.0, and their adherence to incremental improvements with performance gained through process implementation, versus ATI's commitment to an, apparently, more extensive re-invention of design, with process dependency being somewhat more secondary, bit nVidia in the rear.
I think NV40 versus "R390" (I think likely to be the more proper name) could easily be PS 3.0/VS 3.0 versus speedy PS 2.0/VS 2.0, with a lot more nice things being able to be said about nVidia compared to ATI's part at that time.
But we're not at that time yet, and the comparative praise that I'm seeing either seems nonsensical to me, Chalnoth, or is something coherent that still seems flawed from what I can see, Demo.