Chalnoth said:
demalion said:
Chalnoth said:
Type definition requires programmer input for proper results in all situations (either per-variable or per-instruction).
Why are different range and precision types "required"?
What can you do with lower precision processing that you can't do with higher precision processing? You keep making statements based on there being something, and ignoring when I ask you to specify what it is.
I think you misunderstood me.
I'm saying nothing about requiring lower precision processing.
That's because the need for lower precision processing being a given is implicit in your statements, and you simply ignore every comment I make that asks you why it is important.
I'm saying that to properly make use of lower precision processing, you cannot effectively allow the driver to detect when it can lower the precision. You need programmer input.
I follow this...if you use lower precision processing, you need programmer input. But why are you using lower precision processing? Why should the programmer make that input?
Therefore, you need type definitions in the shader language to properly use different levels of precision.
Well, what I asked you was why you feel free to criticize the lack of lower precision
limitations in language because the NV3x depends upon it, yet you see no problem in a lack of ability to portion processing for scalar and vec3 processing in the NV3x?
Wouldn't that ability allow the NV3x to avoid "transistor waste" more effectively? Your entire demand and assumption of reason for lower precision processing is built on a dependence on transistor savings in hardware. You ignore the R3xx compared to the NV3x, and apparently even the latest incarnation of the NV3x as well, and propose the transistor savings as a self evident fact not relevant to any details, such as transistor count of actual hardware or that low precision processing units sit idle when you aren't processing at lower precision.
You continue to blame anything and everything but the hardware that depends on lower precision, as if the ignoring the issues demonstrated with supporting lowered precision causes them to disappear. Could you actually respond to a question or assertion
without simple denial and restatement of your "articles of faith"?
Well, to be consistent with your own stipulations, aren't you asking for different instruction types for different precision processing in the DX9 LLSL or ARB_fragment_program? Was the easy parallel to per instruction data masking in low level shader code too inconvenient?
I'm really not sure what you're trying to say here. This is what I think would be the easiest:
Force the programmer to declare temporaries to be used in the shader.
At declaration of the temporaries to be used, assign a specific data type to that temporary. Implementations must use a data type that provides at least the accuracy of the data type chosen.
Skipped over quite a few inconvenient questions, didn't you? Why are you avoiding challenges to your stated viewpoint?
What you're describing is how scalars and vec3 types are declared as well.
The difference between dependence on vec3/scalar and varying range precision isn't the reasons you stated about one being subject to programmer input, and the other not, because at the same level of abstraction both can be subject to programmer input: there are vec4, vec3, and scalar datatypes at the higher level of abstraction, and the programmer decides on that declaration.
The difference between them is that one is easier to optimize for on the fly. A scalar/vec3 coissue is made visible by processing a scalar and vec3 datatype, or analyzing component usage...you're not recognizing the first.
This does not prevent them from being compared. When you do compare, it seems to indicate that the choice of one and not the other as a dependency for transistor savings and full performance can be evaluated.
When these apparently artificial disqualifications for comparison are removed, what it seems we're left with is that one type of transistor savings, the one you're focusing on to exclusion, requires range and precision limitation and programmer planning because it can't be done effectively in a simple fashion, and the other, by your own statements as well, does not require that level of specific programmer planning and input,
though programmer input can still be utilized to expose it.
You are proposing that when the first cannot be done effectively on the fly, this is not a deficiency in depending on that mechanism for transistor savings. That's where implicit desirability for lower precision processing is assumed without recognition of reality or counterargument.
There is more than one way to optimize hardware for varying datatypes...range and precision is not the only way. Range and precision requiring unique and extra planning is a negative, and
does not grant any new abilities except the ability to spend time planning around the limitations of hardware that depends on it. You consistently avoid specifying the advantages it offers, and just keep proposing that some advantage being gained is a given.
When lower precision can't be used, lower precision
processing units are precluded. When a scalar op and a vec3-or-less op can't be executed, no new waste is incurred...at worst, the same waste that would occur with vec4 only units will manifest.
You still haven't provided the argument as to why lower precisions are worthwhile, you're just repeating yourself. Well, so far it seems to boil down to "the NV3x needs them, and stating otherwise would mean that dependence is a flaw in its design".
The output is still 8-bit. If the majority of processing isn't going to require much higher than 8-bit precision, why not accelerate lower precision than 24-bit?
Because all replications of those units sit idle when 24-bit is needed, and the more parallelism, the worse that is; because the nearer to final output precision processing is, the less processing you can do without manifesting errors; because if you accelerate 24-bit processing itself, you can still use it to output lower precision when needed, and the relationship does not work in the other direction; because there are other opportunities for transistor savings that don't require more potential for idle units, but work to reduce it.
Maybe if you ignore that the R3xx made one decision and the NV3x made another. Well, I'm understating things...you have to ignore the entire idea of hardware design limitations and tradeoffs, but you seem to be in good practice.
I think you are misunderstanding. Support of FP32 + lower types vs. FP24 is, as I see it, orthogonal to the support for coissue of vec and scalar ops. That is, I don't see that as the tradeoff here.
Read the bold part again, and note that the tradeoff has already been made in reality. What are you looking at when you "see" and "don't see" the things you mention?
The tradeoff that I see is that nVidia chose to allow developers to select lower precisions for higher possible performance than would be attainable otherwise, at the expense of performance in shaders that require all high-precision calculations.
Then the tradeoff you are seeing is completely fictitious...I even put the fiction in bold for you. The peak throughput for the NV3x is 12, the peak throughput for the R3xx is 16. For the R3xx, texture ops don't reduce that peak, and for the NV3x, they do.
But the coissue of vector and scalar ops is separate.
This stipulation appears to be completely fictitious as well...scalar and vec3 specification occurs at the same declaration instance you are proposing for precision declaration.
There's no reason that an architecture can't support FP32/FP16/FX12 as well as coissue.
Except that the architecture in question doesn't happen to support coissue. I asked why you refused to recognize that as a flaw associated with the NV3x's unsuitability, not for you to demonstrate the practice again.
There's no reason an architecture can't support FP24 with no coissue.
Could we stop with the fiction? You continue to maintain that all cross vendor shader specifications are against the NV3x, and allied with the R3xx. Now you're doing it by re-ordering the universe to "fp32/fp16/fx12 and coissue" and "fp24 without coissue" to avoid recognizing that "fp24 with coissue" and "fp32/fp16/fx12 without coissue" is related to why the R3xx succeeds with cross vendor shader specifications and the NV3x stumbles.
What I'm asking you is why they're better than fp24. You could also construct simple arguments that explain why FX12 and FP16 are NOT enough, but then you'd have to tackle that in THOSE circumstances, dependency on lower precision for processing is purely a weakness.
It's
not purely a weakness because the NV3x can execute more vector arithmetic operations per clock than the R3xx.
Added some bolding at the end for reading assistance.
Heh, "more vector arithmetic operations per clock". Again, operations on scalars seem have to disappeared. But you "don't know" about scalar ops in shaders, so that's OK.
But where'd the texture ops go?
and others have sufficiently described where FP32 may be beneficial (texture ops).
Hmm? The recent discussion I recall was about how fp24 is beneficial, and sufficient. Could you point out what you are thinking of?
I was just speaking in terms of FP32 vs. FP16. FP24 should be enough as well.
I was confused when you didn't say "FP24".
nVidia has stated that 32-bit FP really is necessary for proper operation, but I am unsure how much relevance this has. I'm sure FP32 will show benefits in some situations over FP24. I'm just unsure as to where exactly (except the obvious case of very long shaders).
How does your "not much more than 8-bit precision" fit into long shaders, by the way? Isn't it
both full speed processing
and greater precision matter more for greater shader lengths?
Seems to be more of "recognizing benefits of fp32 and ignoring drawbacks of fp16 and fx12", in any case.
I'm assuming this a editing mistake, and I'm not just missing a subtle rewording of my statement. Or were you just saving me the trouble of saying this again?
I can't construct a simple argument on how coissue of vector and scalar ops is good.
Err...when you have a vec3 and a scalar to coissue, and have some reason to want your shader to execute more quickly?
What I meant was: how often will scalar ops actually be used in 3D graphics rendering?
You continue to propose you have that answer for FX12 usage, but your substantiation seems a bit thin.
Do you know of a shader that a game developer might realistically use that would be half scalar ops and half vec3 ops?
That's only required for the R3xx to lead outside of texture ops and register usage beyond the NV3x limitations. Do you know of a shader benchmark where the NV3x leads even as much as its clock advantage? You do seem to be trying to ignore discussion of Dawn.
What about texture ops and register utilization?
And you have more information on how often FX12 precision and range are used in pixel shaders? Is that a product of thought and investigation, or selective vision? If it is the product of thought and investigation, please share so we'll have something useful to discuss?
Again: output is 8-bit integer. Here are two simple algorithms that will work great with FX12:
1. Weighted averaging of four values (used in bilinear filtering)
2. Multiply add: x + a*y (where x and y are color vectors, and a is a value, between 0 and 1, used in alpha blending)
What happens when you use these values for further calculations? Or are we talking about simple shaders again?
These are common in basic 3D rendering. It just makes sense that many shaders will want to use these or similar instructions. For example, the blending may be useful for a pretty much any shader that makes use of several components: diffuse, specular, gloss, etc..
Yes, but if that's all you wanted to do, you could have used DX 8 hardware. Again, I point out that the R3xx seems to keep up well with the NV3x in executing Dawn shaders. How does that happen with the NV3x having a clock speed and throughput advantage while running code hand tuned for it, and the R3xx being tied down with less transistors in the first place, the silly scalar/vec3 thing, and "wasting" transistors for higher precision processing?
Could your analysis be flawed in some way?
Hmm...well, maybe it's a fluke, as well as all the other shader benchmarks, but could you help with pointing me in the direction of the info that supports that belief? Without involving clipping planes again.
Ah, so that transistor waste is bad. What does that have to do with GPU transistor usage, though?
I've already mentioned your fallacy when you tried to say that ARB_fragment_program and DX 9 LLSL are the R3xx low level language...GPUs don't have hardware to implement shader "assembly" specifications.
But it's still lower level than an HLSL. It's still more limiting to the hardware than an HLSL would be.
I asked you to tell me why the NV3x deficiencies wouldn't manifest in HLSL, not to just state that they wouldn't.
Low level language has drawbacks and strengths...without a faulty comparisons, please show how the drawbacks shown by the NV3x in various LLSLs wouldn't be shown in a shader HLSL? The fact of the matter is textures, registers, and the same operations are still there, because they are the purpose of the shader in the HLSL or the LLSL.
The NV3x is quirky in more ways than just data types. There's also the limited register usage, for one. An HLSL could conceivably do a better job at figuring out how to assign register usage than could an assembly to machine compiler.
Ayep, and the R3xx doesn't have that limitation. The LLSL exposing that isn't because LLSL is bad, it is because the NV3x is bad at LLSL. The headaches for working around that are NV3x specific. The headaches don't disappear when using the HLSL, it just does more work trying to avoid those headaches for the programmer.
These headaches are the only tangible manifestation of the benefits lower precision processing that have been demonstrated.
Your calling it assembly and drawing the parallel to x86 hardware instruction implementation and "transistor wasting" is the only basis you present for low level specifications being bad.
Wasting transistors in x86 processors isn't the only thing that is bad. You should know: many other processors have much higher performance than the x86 architecture. x86 only keeps up through having more money dedicated to the market.
And what does that have to do with the flaws in your analogy?