Rendering paths in Doom3 revisited..

Humus · Jun 4, 2003

Entropy said:
On the other hand, he has given no indication that I'm aware of that he would recode the Doom3 engine using a HLSL.

He wouldn't have to recode the whole engine, just write another rendering backend. It may not neccesarily be quick and easy, but doesn't sound like an extremely huge task either given that he already has plenty of backends done. I figure he could more or less just reuse most of his ARB2 code but change the shader upload code and setting constants etc. There's not much difference in coding for an HLSL than it's to code for an asm language.

demalion · Jun 5, 2003

Chalnoth said:
Entropy said:

and I have seen no indication that Carmack has tailored it to the R3xx family specifically. It would be foolish, and by his own words, unnecessary.

Entropy

Click to expand...

Which is the point. The R3xx architecture is so closely-tied to the specs (it appears that ATI got pretty much exactly what they wanted for the standard assembly shader language specification in both DX9 and OpenGL) that there won't be much difference in optimizing for the R3xx in ARB2 and optimizing for any other video card in ARB2.

Pixel shader assembly language specs have some basic characteristics:

Have instructions
Use registers
Use textures
Use values
[list:561810f05d]
Commonly vary distinctly in the number of components used for instructions
Require widely varying ranges and precisions depending on what is being done

[/list:u:561810f05d]

The R3xx excels in the first 3: common case granularity of one and high parallelism for dispatching the first; can perform at full speed within its stated register limitations; can execute texture ops without sacrificing the prior capabilities.
In the last, it processes everything at the same range/precision levels, but can also take advantage of component usage to double throughput in the case of scalar and vec3-or-less parallelism. It also is not precluded from inputing and outputing varying precision values.

The most prominent problem: when processing components that don't require full precision, transistors are being "wasted".

The NV3x: has a "chunkier" granularity and reduced "parallelism" for execution of the first; is severely limited in usage of the second in order to maintain performance in executing the first; and has to give up throughput in the first when doing the third.
In the last, it might be able to take advantage of component usage to prevent reduction of throughput from mapping to its unit compliment, but not to increase it.

There are many prominent problems, but the one that parallels that of the R3xx: if it has units that DEPEND on lower precision processing, they are completely wasted when that precision is insufficient..at best, they just interfere with finding optimization opportunities. Also, the NV3x seems to always waste transistors when a unique scalar op is being processed. Finally, it is "wasting" transistors as well when the full processing precision is not required.

Just maybe all of these problems are all related? You recognize the possible virtues of fp32 compared to fp24, but you then ignore the clearly established limitations of fp16 and fx12 in comparison to fp24, and then further ignore that managing switching between them just might be related to the problems the NV3x exhibits.

...

The R3xx approach tries to guarantee some utilization of processing units, and seeks opportunities to increase throughput from its baseline.

The NV3x approach, or any with performance dependent on reduced precision processing or value storage (it does both), works to avoid performance degradation, and the shader has to take on specific limitations to do so.

It doesn't require that the spec was suited specifically to ATI for it to display these characteristics. That commentary seems nonsensical with these observations...just seems a substitute of "IHV name" in place discussing objective merit to avoid recognizing issues independently of how IHVs executed.
Limited register usage, otherwise ignoring vec3/scalar opportunities, and reducing performance when using textures are pure limitations...why should they be part of the spec? It is excelling at those non-IHV related things that allows the R3xx to excel. It is being deficient at those non-IHV related things that causes the NV3x to have problems.

As for the virtue of supporting lower precisions: why is it better than supporting some variation of operations per component? Whether it is or isn't, why do you ignore that so consistently when we have no evidence of reduced precision processing units offering transistor savings when coexisting with higher precision processing units? And when we can observe independently of examples that such units represent waste when not doing low precision processing?

By all accounts, what he calls the ARB2 path will by default be used by all upcoming cards that support it. I wouldn't be surprised at all if all nVidia cards from the NV40 and onwards used it, for instance.

Click to expand...

ARB2 still has one glaring flaw: no instruction-level specification for data types. I see no reason why future hardware should only support one data type.

You persist in repeating that, despite argument to the contrary and the apparent examples of multiple datatype architectures and their failures to offer advantage. Why isn't the lack of an ability to double throughput for scalar and vec3/scalar parallel execution a "glaring flaw" in the NV3x?

And, very soon, I expect JC to move the standard to the GL2 HLSL (GLSlang, I believe it is). ARB2 is just a stopgap: video card architectures just should not be tied to a specific assembly language specification.

The R3xx isn't "tied" to them, it excels at them. It's readily apparent that compared to it, the NV3x does not. Does that make low level languages "bad", or does it make the NV3x "bad"? You consistently propose the first because you refuse to accept the second, and you also continually fail to validate that refusal by any method other than ignoring inconvenient observations.

Question: do the NV3x weaknesses disappear when using a HLSL? It seems to me that all the HLSLs are is greater abstraction that allows more optimization opportunities. The NV3x would still be using those opportunities to try and avoid slowing down, and the R3xx would still be using them to speed up from its baseline.

...

Repetition is a bit tiresome Chalnoth. :-?

However, I would be interested in game developer commentary on this. Since transistor savings and higher replication or clock speed would have to manifest benefit for lower precision processing to show advantage (i.e., low precision in hardware isn't "inherently" good, which it seems you continue to propose), I'd expect the only argument on precision they'd offer that would support the NV3x would be concerning fp32, unless they have some insight into hardware design contrary to the indications so far. Perhaps there will be a surprise worth discussing in this regard.

If we're just talking about supporting it in usage, though, I'll point out again that 24 bit processing doesn't prevent inputing and outputting integer values. A 24 bit floating point datatype doesn't seem to have to do that either.

AndrewM · Jun 5, 2003

I'm pretty sure that John has a GL2 path for 3dlabs stuff. It's obviously not final, but it's built on their early specs. This could have changed (and will change with the release of the final specs).

KimB · Jun 5, 2003

demalion said:
Limited register usage, otherwise ignoring vec3/scalar opportunities, and reducing performance when using textures are pure limitations...why should they be part of the spec? It is excelling at those non-IHV related things that allows the R3xx to excel. It is being deficient at those non-IHV related things that causes the NV3x to have problems.

Co-issue of vector and scalar ops doesn't require programmer input. It's a scheduling issue. Type definition requires programmer input for proper results in all situations (either per-variable or per-instruction).

I mention the different types because neither DX9 nor ARB2 offer enough in the way of different data types to accomodate the NV3x architecture.

As for the virtue of supporting lower precisions: why is it better than supporting some variation of operations per component?

I don't see the two things as opposing. Therefore I don't see one as better than the other. That, and I can easily construct simple arguments that explain why FX12 will be enough in a number of scenarios, FP16 enough in others, and others have sufficiently described where FP32 may be beneficial (texture ops). I can't construct a simple argument on how coissue of vector and scalar ops is good. That depends on how frequently scalars are used in pixel shaders, something for which I have no information.

The R3xx isn't "tied" to them, it excels at them. It's readily apparent that compared to it, the NV3x does not. Does that make low level languages "bad", or does it make the NV3x "bad"? You consistently propose the first because you refuse to accept the second, and you also continually fail to validate that refusal by any method other than ignoring inconvenient observations.

Independent of the NV3x, low level language standards are "bad." The NV3x is just an example. Another easy example is the x86 architecture. It is known that the x86 architecture is highly inefficient: today many transistors are wasted in translating x86 into a RISC instruction set. If, somehow, a low-level language had not been standardized, but instead a higher level language, we would not have this problem: any CPU manufacturer could develop any instruction set with essentially no penalty on legacy software. This would allow competition based upon different instruction sets, not just optimization of one single instruction set (as we see in the PC space...).

Question: do the NV3x weaknesses disappear when using a HLSL? It seems to me that all the HLSLs are is greater abstraction that allows more optimization opportunities. The NV3x would still be using those opportunities to try and avoid slowing down, and the R3xx would still be using them to speed up from its baseline.

I'm not sure the R3xx could speed up from its baseline by a large amount. As one example, look at the amount of extra speed ATI was able to coax out of 3DMark03 through optimizations vs. the amount nVidia was able to coax (granted, this is far from a perfect example, but it's the only one we have).
But, in the end, these differences in view:
"avoid slowing down" vs.
"speed up from its baseline"

Are equivalent to:
"the glass is half empty" vs.
"the glass is half full"

All that your statement does is post the NV3x in a negative light. Compiler optimizations would be used for both chips to "avoid slowing down" and for both chips to "speed up from their baseline."

KimB · Jun 5, 2003

AndrewM said:
I'm pretty sure that John has a GL2 path for 3dlabs stuff. It's obviously not final, but it's built on their early specs. This could have changed (and will change with the release of the final specs).

That's what I had heard. It would certainly be nice if JC could use GL2 as the default/standard in DOOM3, though I suppose we should not be overly optimistic.

silhouette · Jun 5, 2003

Humus said:
Indeed. I just wish people would recognice that ARB2 is nothing but a DoomIII term and has nothing to do with the OpenGL API itself.

No worries... Anyone who knows what ARB is already aware that it is not a Doom 3 term.. I do not know exactly what ARB_fragment_program and ARB_vertex_shader does, but I guess that they are just part of the API that you can send the assembler program and the constants to process fragments and vertices, respectively..

Chalnoth said:
Which is the point. The R3xx architecture is so closely-tied to the specs (it appears that ATI got pretty much exactly what they wanted for the standard assembly shader language specification in both DX9 and OpenGL) that there won't be much difference in optimizing for the R3xx in ARB2 and optimizing for any other video card in ARB2.

I totally disagree, evenif the R3xx architecture is closely tied to the specification, I do not think that there is anything in the specification that exploits the parallelism in the R3xx architecture. I guess it depends on how you write the asm code to process the fragments and vertices. You can write a code which would execute on R3x0 series fastest but also can run on the other hardwares because it is still an ARB_x_shader. I also do not think that the precision is the whole issue here..

demalion said:
Pixel shader assembly language specs have some basic characteristics:

Have instructions

Use registers

Use textures

Use values

Commonly vary distinctly in the number of components used for instructions

Require widely varying ranges and precisions depending on what is being done

I would add the execution speed of the instructions into this mixture.. I do not think that each instruction is executed at the same speed.. Most probably the instructions like sqrt, rsp would take much cycles to be executed than add and mul. You can always write faster code by avoiding these slow executed commands if possible..

I asked original question, because when you look in the initial benchmarks, it seems that Gf series totally trashed R3x0.. But if JC does not bother to write a R3x0 architecture specific ARB_x_shader code, the R3x0 series would never seem to faster than NV3x series. That's why I am curious whether he is ever going to write a R3x0 optimized shader codes which exploits the efficiency of the architecture which may also be used with the ARB2 path..

KimB · Jun 5, 2003

I totally disagree, evenif the R3xx architecture is closely tied to the specification, I do not think that there is anything in the specification that exploits the parallelism in the R3xx architecture.

There doesn't need to be. This is the sort of thing that should be handled by the assembly-to-machine compiler. Automatic data type casting is a much sticker subject, and I'm not sure it would be prudent to attempt it (unless absolutely necessary).

Edit: Had an improper reply. Removed.

silhouette · Jun 5, 2003

Chalnoth said:
I totally disagree, evenif the R3xx architecture is closely tied to the specification, I do not think that there is anything in the specification that exploits the parallelism in the R3xx architecture.

Click to expand...

There doesn't need to be. This is the sort of thing that should be handled by the assembly-to-machine compiler. Automatic data type casting is a much sticker subject, and I'm not sure it would be prudent to attempt it (unless absolutely necessary).

Edit: Had an improper reply. Removed.

Should I be conserned with the edited part?

I agree with the type casting part... Its almost impossible for an assembler-machine compiler to determine the type on fly. There is no doubt on that..

About exploiting parallelism.. THis might be also hard for the asm-2-machine as well. The compiler can check the dependence between the instructions and may re-order them on fly, to exploit that paralellism, but the programmer can do that much better than any compiler.. Thats what I try to say...

Best,

Silhouette

demalion · Jun 5, 2003

Chalnoth said:
demalion said:

Limited register usage, otherwise ignoring vec3/scalar opportunities, and reducing performance when using textures are pure limitations...why should they be part of the spec? It is excelling at those non-IHV related things that allows the R3xx to excel. It is being deficient at those non-IHV related things that causes the NV3x to have problems.

Click to expand...

Co-issue of vector and scalar ops doesn't require programmer input. It's a scheduling issue. Type definition requires programmer input for proper results in all situations (either per-variable or per-instruction).

Why are different range and precision types "required"? What can you do with lower precision processing that you can't do with higher precision processing? You keep making statements based on there being something, and ignoring when I ask you to specify what it is.

Your comment seems fallacious, since declarations in shader programming can declare component usage and range. The distinction you try to make comes off as something made up on the spot to propose there is some unspecified difference to avoid the direct question of why wasting transistors on vec4 dependency is not worthy of criticism. Feel free to correct my impression at some point...repetition and ignoring my commentary isn't doing that, as I've told you several times.

I mention the different types because neither DX9 nor ARB2 offer enough in the way of different data types to accomodate the NV3x architecture.

Well, to be consistent with your own stipulations, aren't you asking for different instruction types for different precision processing in the DX9 LLSL or ARB_fragment_program? Was the easy parallel to per instruction data masking in low level shader code too inconvenient?

Let me inject another attempt at clarity:

"The NV3x has too much dependence on lower precision data processing for performance for either DX 9 or ARB_fragment_program". What is wrong with that statement, besides illustrating that nVidia doesn't have absolute control over the industry?

You still haven't provided the argument as to why lower precisions are worthwhile, you're just repeating yourself. Well, so far it seems to boil down to "the NV3x needs them, and stating otherwise would mean that dependence is a flaw in its design".

As for the virtue of supporting lower precisions: why is it better than supporting some variation of operations per component?

Click to expand...

I don't see the two things as opposing.

Maybe if you ignore that the R3xx made one decision and the NV3x made another. Well, I'm understating things...you have to ignore the entire idea of hardware design limitations and tradeoffs, but you seem to be in good practice. :-?

Therefore I don't see one as better than the other.

Because you're doing anything possible to avoid the idea that the NV3x design might in any way represent poor choices. In this case, by proposing that the idea of distinct choices is meaningless through ignoring that the hardware designs under discussion have indeed made these choices.

That, and I can easily construct simple arguments that explain why FX12 will be enough in a number of scenarios, FP16 enough in others,

What I'm asking you is why they're better than fp24. You could also construct simple arguments that explain why FX12 and FP16 are NOT enough, but then you'd have to tackle that in THOSE circumstances, dependency on lower precision for processing is purely a weakness. Your repetitions simply ignore that factor over and over.

and others have sufficiently described where FP32 may be beneficial (texture ops).

Hmm? The recent discussion I recall was about how fp24 is beneficial, and sufficient. Could you point out what you are thinking of?

Seems to be more of "recognizing benefits of fp32 and ignoring drawbacks of fp16 and fx12", in any case.

I can't construct a simple argument on how coissue of vector and scalar ops is good.

Err...when you have a vec3 and a scalar to coissue, and have some reason to want your shader to execute more quickly?

That depends on how frequently scalars are used in pixel shaders, something for which I have no information.

And you have more information on how often FX12 precision and range are used in pixel shaders? Is that a product of thought and investigation, or selective vision? If it is the product of thought and investigation, please share so we'll have something useful to discuss?

The R3xx isn't "tied" to them, it excels at them. It's readily apparent that compared to it, the NV3x does not. Does that make low level languages "bad", or does it make the NV3x "bad"? You consistently propose the first because you refuse to accept the second, and you also continually fail to validate that refusal by any method other than ignoring inconvenient observations.

Click to expand...

Independent of the NV3x, low level language standards are "bad." The NV3x is just an example.

Please note the part in bold.

Another easy example is the x86 architecture. It is known that the x86 architecture is highly inefficient: today many transistors are wasted in translating x86 into a RISC instruction set. If, somehow, a low-level language had not been standardized, but instead a higher level language, we would not have this problem: any CPU manufacturer could develop any instruction set with essentially no penalty on legacy software. This would allow competition based upon different instruction sets, not just optimization of one single instruction set (as we see in the PC space...).

Ah, so that transistor waste is bad. What does that have to do with GPU transistor usage, though?

I've already mentioned your fallacy when you tried to say that ARB_fragment_program and DX 9 LLSL are the R3xx low level language...GPUs don't have hardware to implement shader "assembly" specifications. The low level specifications are just another language that needs to be converted for each hardware. Making an assertion otherwise just seems to avoid recognition that one hardware's microcode might be inferior for what the LLSL is trying to accomplish, and that if a HLSL is trying to accomplish the same thing, the issues will still be there. In the context of your analogy, the NV3x would need to waste more transistors than the R3xx to do the same work.

Low level language has drawbacks and strengths...without a faulty comparisons, please show how the drawbacks shown by the NV3x in various LLSLs wouldn't be shown in a shader HLSL? The fact of the matter is textures, registers, and the same operations are still there, because they are the purpose of the shader in the HLSL or the LLSL.

Your calling it assembly and drawing the parallel to x86 hardware instruction implementation and "transistor wasting" is the only basis you present for low level specifications being bad. Again, what it looks like is you are throwing a bad example to put off simply answering the question of whether, just maybe, the NV3x might be "bad" itself. Witness the again repeated practice of ignoring the issues of texture usage, registers, "parallelism", etc. The problem with your vec3/scalar "response" was more of the same.

Question: do the NV3x weaknesses disappear when using a HLSL? It seems to me that all the HLSLs are is greater abstraction that allows more optimization opportunities. The NV3x would still be using those opportunities to try and avoid slowing down, and the R3xx would still be using them to speed up from its baseline.

Click to expand...

I'm not sure the R3xx could speed up from its baseline by a large amount. As one example, look at the amount of extra speed ATI was able to coax out of 3DMark03 through optimizations vs. the amount nVidia was able to coax (granted, this is far from a perfect example, but it's the only one we have).

I guess you forgot a few things...like clipping planes?! You know, just maybe that might have some relation to the "optimization" speed up nVidia hardware experienced?!

How do the performance results and image quality compare with other benchmarks where we know there aren't clipping planes involved? Why is introducing the adjusted clipping plane shader performance being proposed by you as more valid than the results with the application detection defeated? What image quality do you end up with, and how does the performance between the R3xx at its clock speed compare to the NV3x, and at what clock speed?

By the way, it isn't the "only" example we have. We know that Dawn was specifically coded for the NV3x, compiled with Cg and hand tuned. Would you say that the eyelid problem is responsible for how the R3xx performance compares with NV3x?

But, in the end, these differences in view:
"avoid slowing down" vs.
"speed up from its baseline"

Are equivalent to:
"the glass is half empty" vs.
"the glass is half full"

Well, it's nice of you to say that after avoiding every aspect of my argument, yet again. Thanks.

It's more like having one glass with no leaks versus another glass that leaks, with the first glass being able to hold more water to boot.

If we're just going to throw analogies out there like that, that is.

All that your statement does is post the NV3x in a negative light.

Actually, I was being descriptive. Would you care to illustrate why my description of the NV3x as having problems is flawed? I hope you do realize that you've consistently refused to do so. Please do so without ignoring shader performance figures and output, and without including clipping plane manipulations like you did above.

Compiler optimizations would be used for both chips to "avoid slowing down" and for both chips to "speed up from their baseline."

So you're dropping the "12 versus 8" example you keep repeating? You didn't inform me of that yet.

When your discussion isn't focused solely on the best case of the NV3x, the context of that description proposed in reply will be inappropriate, not before. Your inability to acknowledge even one of the problems of the NV3x other than by deeming it irrelevant, "unknown", or the fault of the "assembly" specification is remarkable.

...

Hey, isn't the nVidia fragment shader extension in OpenGL a low level shader language too? Why isn't that one bad?

KimB · Jun 5, 2003

silhouette said:
Should I be conserned with the edited part?

Hehe, no

I skimmed to quickly and misunderstood what was said, so the reply I posted made absolutely no sense. That's all.

About exploiting parallelism.. THis might be also hard for the asm-2-machine as well. The compiler can check the dependence between the instructions and may re-order them on fly, to exploit that paralellism, but the programmer can do that much better than any compiler.. Thats what I try to say...

If the programmer is away that the hardware likes to coissue vector and scalar ops, then it should be simple just to place them together in the code. There isn't much reason to explicitly state coissue, hence there is little reason to explicitly support it in the shader language (it should be simple enough to check whether two subsequent instructions can be co-issued...and it needn't even be done on the GPU - the drivers could set it when compiling the assembly to machine, which will significantly increase the amount of optimization that can be accomplished).

KimB · Jun 5, 2003

demalion said:
Chalnoth said:

Type definition requires programmer input for proper results in all situations (either per-variable or per-instruction).

Click to expand...

Why are different range and precision types "required"? What can you do with lower precision processing that you can't do with higher precision processing? You keep making statements based on there being something, and ignoring when I ask you to specify what it is.

I think you misunderstood me.

I'm saying nothing about requiring lower precision processing.

I'm saying that to properly make use of lower precision processing, you cannot effectively allow the driver to detect when it can lower the precision. You need programmer input. Therefore, you need type definitions in the shader language to properly use different levels of precision.

Well, to be consistent with your own stipulations, aren't you asking for different instruction types for different precision processing in the DX9 LLSL or ARB_fragment_program? Was the easy parallel to per instruction data masking in low level shader code too inconvenient?

I'm really not sure what you're trying to say here. This is what I think would be the easiest:
Force the programmer to declare temporaries to be used in the shader.
At declaration of the temporaries to be used, assign a specific data type to that temporary. Implementations must use a data type that provides at least the accuracy of the data type chosen.

You still haven't provided the argument as to why lower precisions are worthwhile, you're just repeating yourself. Well, so far it seems to boil down to "the NV3x needs them, and stating otherwise would mean that dependence is a flaw in its design".

The output is still 8-bit. If the majority of processing isn't going to require much higher than 8-bit precision, why not accelerate lower precision than 24-bit?

Maybe if you ignore that the R3xx made one decision and the NV3x made another. Well, I'm understating things...you have to ignore the entire idea of hardware design limitations and tradeoffs, but you seem to be in good practice.

I think you are misunderstanding. Support of FP32 + lower types vs. FP24 is, as I see it, orthogonal to the support for coissue of vec and scalar ops. That is, I don't see that as the tradeoff here.

The tradeoff that I see is that nVidia chose to allow developers to select lower precisions for higher possible performance than would be attainable otherwise, at the expense of performance in shaders that require all high-precision calculations.

But the coissue of vector and scalar ops is separate. There's no reason that an architecture can't support FP32/FP16/FX12 as well as coissue. There's no reason an architecture can't support FP24 with no coissue.

What I'm asking you is why they're better than fp24. You could also construct simple arguments that explain why FX12 and FP16 are NOT enough, but then you'd have to tackle that in THOSE circumstances, dependency on lower precision for processing is purely a weakness.

It's not purely a weakness because the NV3x can execute more vector arithmetic operations per clock than the R3xx.

and others have sufficiently described where FP32 may be beneficial (texture ops).

Click to expand...

Hmm? The recent discussion I recall was about how fp24 is beneficial, and sufficient. Could you point out what you are thinking of?

I was just speaking in terms of FP32 vs. FP16. FP24 should be enough as well. nVidia has stated that 32-bit FP really is necessary for proper operation, but I am unsure how much relevance this has. I'm sure FP32 will show benefits in some situations over FP24. I'm just unsure as to where exactly (except the obvious case of very long shaders).

Seems to be more of "recognizing benefits of fp32 and ignoring drawbacks of fp16 and fx12", in any case.

I can't construct a simple argument on how coissue of vector and scalar ops is good.

Click to expand...

Err...when you have a vec3 and a scalar to coissue, and have some reason to want your shader to execute more quickly?

What I meant was: how often will scalar ops actually be used in 3D graphics rendering? Do you know of a shader that a game developer might realistically use that would be half scalar ops and half vec3 ops?

And you have more information on how often FX12 precision and range are used in pixel shaders? Is that a product of thought and investigation, or selective vision? If it is the product of thought and investigation, please share so we'll have something useful to discuss?

Again: output is 8-bit integer. Here are two simple algorithms that will work great with FX12:
1. Weighted averaging of four values (used in bilinear filtering)
2. Multiply add: x + a*y (where x and y are color vectors, and a is a value, between 0 and 1, used in alpha blending)

These are common in basic 3D rendering. It just makes sense that many shaders will want to use these or similar instructions. For example, the blending may be useful for a pretty much any shader that makes use of several components: diffuse, specular, gloss, etc..

Ah, so that transistor waste is bad. What does that have to do with GPU transistor usage, though?

I've already mentioned your fallacy when you tried to say that ARB_fragment_program and DX 9 LLSL are the R3xx low level language...GPUs don't have hardware to implement shader "assembly" specifications.

But it's still lower level than an HLSL. It's still more limiting to the hardware than an HLSL would be.

Low level language has drawbacks and strengths...without a faulty comparisons, please show how the drawbacks shown by the NV3x in various LLSLs wouldn't be shown in a shader HLSL? The fact of the matter is textures, registers, and the same operations are still there, because they are the purpose of the shader in the HLSL or the LLSL.

The NV3x is quirky in more ways than just data types. There's also the limited register usage, for one. An HLSL could conceivably do a better job at figuring out how to assign register usage than could an assembly to machine compiler.

Your calling it assembly and drawing the parallel to x86 hardware instruction implementation and "transistor wasting" is the only basis you present for low level specifications being bad.

Wasting transistors in x86 processors isn't the only thing that is bad. You should know: many other processors have much higher performance than the x86 architecture. x86 only keeps up through having more money dedicated to the market.

Entropy · Jun 5, 2003

Humus said:
Entropy said:

On the other hand, he has given no indication that I'm aware of that he would recode the Doom3 engine using a HLSL.

Click to expand...

He wouldn't have to recode the whole engine, just write another rendering backend. It may not neccesarily be quick and easy, but doesn't sound like an extremely huge task either given that he already has plenty of backends done. I figure he could more or less just reuse most of his ARB2 code but change the shader upload code and setting constants etc.

I'm sure that you're correct, but he hasn't indicated that he'll do it, and there doesn't seem to be much point in performing such reproductive work now other than for comparative purposes. The indication was rather that he'd put his efforts into forward looking projects.

The man has a mind of his own though. Who knows?

There's not much difference in coding for an HLSL than it's to code for an asm language.

That was the impression I had gotten from the comparative examples of shader code I have seen. Going from the higher level code to asm seemed pretty trivial to be honest. Furthermore, programmers make libraries of these little code snippets.
You might speculate over just where your own example code will end up over time.

Entropy

DegustatoR · Jun 5, 2003

MuFu said:
I'd be interested to see what comes of that - can't see how it'd be possible (without falling back to FP16).

You expect any precision artifacts in Doom 3 w/FP16?

AFAIR, Carmack stated sometime in the past, that FP16 is well enough for Doom 3 game.

Then again, you can alwayse switch to FP32 on NV3x in critical to precision areas of code for a cost of some prefomance. Something telling me, though, that there won't be many such areas...

Ilfirin · Jun 5, 2003

I expect precision artifacts in the specular highlights from the low-precision normal maps.

demalion · Jun 5, 2003

Chalnoth said:
demalion said:

Chalnoth said:

Type definition requires programmer input for proper results in all situations (either per-variable or per-instruction).

Click to expand...

Why are different range and precision types "required"? What can you do with lower precision processing that you can't do with higher precision processing? You keep making statements based on there being something, and ignoring when I ask you to specify what it is.

Click to expand...

I think you misunderstood me.

I'm saying nothing about requiring lower precision processing.

That's because the need for lower precision processing being a given is implicit in your statements, and you simply ignore every comment I make that asks you why it is important.

I'm saying that to properly make use of lower precision processing, you cannot effectively allow the driver to detect when it can lower the precision. You need programmer input.

I follow this...if you use lower precision processing, you need programmer input. But why are you using lower precision processing? Why should the programmer make that input?

Therefore, you need type definitions in the shader language to properly use different levels of precision.

Well, what I asked you was why you feel free to criticize the lack of lower precision limitations in language because the NV3x depends upon it, yet you see no problem in a lack of ability to portion processing for scalar and vec3 processing in the NV3x?

Wouldn't that ability allow the NV3x to avoid "transistor waste" more effectively? Your entire demand and assumption of reason for lower precision processing is built on a dependence on transistor savings in hardware. You ignore the R3xx compared to the NV3x, and apparently even the latest incarnation of the NV3x as well, and propose the transistor savings as a self evident fact not relevant to any details, such as transistor count of actual hardware or that low precision processing units sit idle when you aren't processing at lower precision.

You continue to blame anything and everything but the hardware that depends on lower precision, as if the ignoring the issues demonstrated with supporting lowered precision causes them to disappear. Could you actually respond to a question or assertion without simple denial and restatement of your "articles of faith"?

Well, to be consistent with your own stipulations, aren't you asking for different instruction types for different precision processing in the DX9 LLSL or ARB_fragment_program? Was the easy parallel to per instruction data masking in low level shader code too inconvenient?

Click to expand...

I'm really not sure what you're trying to say here. This is what I think would be the easiest:
Force the programmer to declare temporaries to be used in the shader.
At declaration of the temporaries to be used, assign a specific data type to that temporary. Implementations must use a data type that provides at least the accuracy of the data type chosen.

Skipped over quite a few inconvenient questions, didn't you? Why are you avoiding challenges to your stated viewpoint?

What you're describing is how scalars and vec3 types are declared as well.

The difference between dependence on vec3/scalar and varying range precision isn't the reasons you stated about one being subject to programmer input, and the other not, because at the same level of abstraction both can be subject to programmer input: there are vec4, vec3, and scalar datatypes at the higher level of abstraction, and the programmer decides on that declaration.

The difference between them is that one is easier to optimize for on the fly. A scalar/vec3 coissue is made visible by processing a scalar and vec3 datatype, or analyzing component usage...you're not recognizing the first.
This does not prevent them from being compared. When you do compare, it seems to indicate that the choice of one and not the other as a dependency for transistor savings and full performance can be evaluated.

When these apparently artificial disqualifications for comparison are removed, what it seems we're left with is that one type of transistor savings, the one you're focusing on to exclusion, requires range and precision limitation and programmer planning because it can't be done effectively in a simple fashion, and the other, by your own statements as well, does not require that level of specific programmer planning and input, though programmer input can still be utilized to expose it.
You are proposing that when the first cannot be done effectively on the fly, this is not a deficiency in depending on that mechanism for transistor savings. That's where implicit desirability for lower precision processing is assumed without recognition of reality or counterargument.

There is more than one way to optimize hardware for varying datatypes...range and precision is not the only way. Range and precision requiring unique and extra planning is a negative, and does not grant any new abilities except the ability to spend time planning around the limitations of hardware that depends on it. You consistently avoid specifying the advantages it offers, and just keep proposing that some advantage being gained is a given.

When lower precision can't be used, lower precision processing units are precluded. When a scalar op and a vec3-or-less op can't be executed, no new waste is incurred...at worst, the same waste that would occur with vec4 only units will manifest.

You still haven't provided the argument as to why lower precisions are worthwhile, you're just repeating yourself. Well, so far it seems to boil down to "the NV3x needs them, and stating otherwise would mean that dependence is a flaw in its design".

Click to expand...

The output is still 8-bit. If the majority of processing isn't going to require much higher than 8-bit precision, why not accelerate lower precision than 24-bit?

Because all replications of those units sit idle when 24-bit is needed, and the more parallelism, the worse that is; because the nearer to final output precision processing is, the less processing you can do without manifesting errors; because if you accelerate 24-bit processing itself, you can still use it to output lower precision when needed, and the relationship does not work in the other direction; because there are other opportunities for transistor savings that don't require more potential for idle units, but work to reduce it.

Maybe if you ignore that the R3xx made one decision and the NV3x made another. Well, I'm understating things...you have to ignore the entire idea of hardware design limitations and tradeoffs, but you seem to be in good practice.

Click to expand...

I think you are misunderstanding. Support of FP32 + lower types vs. FP24 is, as I see it, orthogonal to the support for coissue of vec and scalar ops. That is, I don't see that as the tradeoff here.

Read the bold part again, and note that the tradeoff has already been made in reality. What are you looking at when you "see" and "don't see" the things you mention?

The tradeoff that I see is that nVidia chose to allow developers to select lower precisions for higher possible performance than would be attainable otherwise, at the expense of performance in shaders that require all high-precision calculations.

Then the tradeoff you are seeing is completely fictitious...I even put the fiction in bold for you. The peak throughput for the NV3x is 12, the peak throughput for the R3xx is 16. For the R3xx, texture ops don't reduce that peak, and for the NV3x, they do.

But the coissue of vector and scalar ops is separate.

This stipulation appears to be completely fictitious as well...scalar and vec3 specification occurs at the same declaration instance you are proposing for precision declaration.

There's no reason that an architecture can't support FP32/FP16/FX12 as well as coissue.

Except that the architecture in question doesn't happen to support coissue. I asked why you refused to recognize that as a flaw associated with the NV3x's unsuitability, not for you to demonstrate the practice again.

There's no reason an architecture can't support FP24 with no coissue.

Could we stop with the fiction? You continue to maintain that all cross vendor shader specifications are against the NV3x, and allied with the R3xx. Now you're doing it by re-ordering the universe to "fp32/fp16/fx12 and coissue" and "fp24 without coissue" to avoid recognizing that "fp24 with coissue" and "fp32/fp16/fx12 without coissue" is related to why the R3xx succeeds with cross vendor shader specifications and the NV3x stumbles.

What I'm asking you is why they're better than fp24. You could also construct simple arguments that explain why FX12 and FP16 are NOT enough, but then you'd have to tackle that in THOSE circumstances, dependency on lower precision for processing is purely a weakness.

Click to expand...

It's not purely a weakness because the NV3x can execute more vector arithmetic operations per clock than the R3xx.

Added some bolding at the end for reading assistance.
Heh, "more vector arithmetic operations per clock". Again, operations on scalars seem have to disappeared. But you "don't know" about scalar ops in shaders, so that's OK.

But where'd the texture ops go?

and others have sufficiently described where FP32 may be beneficial (texture ops).

Click to expand...

Hmm? The recent discussion I recall was about how fp24 is beneficial, and sufficient. Could you point out what you are thinking of?

Click to expand...

I was just speaking in terms of FP32 vs. FP16. FP24 should be enough as well.

I was confused when you didn't say "FP24".

nVidia has stated that 32-bit FP really is necessary for proper operation, but I am unsure how much relevance this has. I'm sure FP32 will show benefits in some situations over FP24. I'm just unsure as to where exactly (except the obvious case of very long shaders).

How does your "not much more than 8-bit precision" fit into long shaders, by the way? Isn't it both full speed processing and greater precision matter more for greater shader lengths?

Seems to be more of "recognizing benefits of fp32 and ignoring drawbacks of fp16 and fx12", in any case.

I'm assuming this a editing mistake, and I'm not just missing a subtle rewording of my statement. Or were you just saving me the trouble of saying this again?

I can't construct a simple argument on how coissue of vector and scalar ops is good.

Click to expand...

Err...when you have a vec3 and a scalar to coissue, and have some reason to want your shader to execute more quickly?

Click to expand...

What I meant was: how often will scalar ops actually be used in 3D graphics rendering?

You continue to propose you have that answer for FX12 usage, but your substantiation seems a bit thin.

Do you know of a shader that a game developer might realistically use that would be half scalar ops and half vec3 ops?

That's only required for the R3xx to lead outside of texture ops and register usage beyond the NV3x limitations. Do you know of a shader benchmark where the NV3x leads even as much as its clock advantage? You do seem to be trying to ignore discussion of Dawn.

What about texture ops and register utilization?

And you have more information on how often FX12 precision and range are used in pixel shaders? Is that a product of thought and investigation, or selective vision? If it is the product of thought and investigation, please share so we'll have something useful to discuss?

Click to expand...

Again: output is 8-bit integer. Here are two simple algorithms that will work great with FX12:
1. Weighted averaging of four values (used in bilinear filtering)
2. Multiply add: x + a*y (where x and y are color vectors, and a is a value, between 0 and 1, used in alpha blending)

What happens when you use these values for further calculations? Or are we talking about simple shaders again?

These are common in basic 3D rendering. It just makes sense that many shaders will want to use these or similar instructions. For example, the blending may be useful for a pretty much any shader that makes use of several components: diffuse, specular, gloss, etc..

Yes, but if that's all you wanted to do, you could have used DX 8 hardware. Again, I point out that the R3xx seems to keep up well with the NV3x in executing Dawn shaders. How does that happen with the NV3x having a clock speed and throughput advantage while running code hand tuned for it, and the R3xx being tied down with less transistors in the first place, the silly scalar/vec3 thing, and "wasting" transistors for higher precision processing?

Could your analysis be flawed in some way?

Hmm...well, maybe it's a fluke, as well as all the other shader benchmarks, but could you help with pointing me in the direction of the info that supports that belief? Without involving clipping planes again.

Ah, so that transistor waste is bad. What does that have to do with GPU transistor usage, though?

I've already mentioned your fallacy when you tried to say that ARB_fragment_program and DX 9 LLSL are the R3xx low level language...GPUs don't have hardware to implement shader "assembly" specifications.

Click to expand...

But it's still lower level than an HLSL. It's still more limiting to the hardware than an HLSL would be.

I asked you to tell me why the NV3x deficiencies wouldn't manifest in HLSL, not to just state that they wouldn't.

Low level language has drawbacks and strengths...without a faulty comparisons, please show how the drawbacks shown by the NV3x in various LLSLs wouldn't be shown in a shader HLSL? The fact of the matter is textures, registers, and the same operations are still there, because they are the purpose of the shader in the HLSL or the LLSL.

Click to expand...

The NV3x is quirky in more ways than just data types. There's also the limited register usage, for one. An HLSL could conceivably do a better job at figuring out how to assign register usage than could an assembly to machine compiler.

Ayep, and the R3xx doesn't have that limitation. The LLSL exposing that isn't because LLSL is bad, it is because the NV3x is bad at LLSL. The headaches for working around that are NV3x specific. The headaches don't disappear when using the HLSL, it just does more work trying to avoid those headaches for the programmer.
These headaches are the only tangible manifestation of the benefits lower precision processing that have been demonstrated.

Your calling it assembly and drawing the parallel to x86 hardware instruction implementation and "transistor wasting" is the only basis you present for low level specifications being bad.

Click to expand...

Wasting transistors in x86 processors isn't the only thing that is bad. You should know: many other processors have much higher performance than the x86 architecture. x86 only keeps up through having more money dedicated to the market.

And what does that have to do with the flaws in your analogy?

jpaana · Jun 5, 2003

Coissuing scalar and vector ops is quite handy and part of the PS 1.1-1.4 spec and both NV2x and R2x0 benefit from it. NV2x have separate alpha and rgb combiners (that can be combined too) and similarily R2x0.

A real life example is the bumpmap shader from Tenebrae:

ps_1_4
texld r0, t0
texld r1, t1
texld r2, t2
texld r3, t3
texld r4, t4
dp3_sat r2.rgb, r0_bx2, r2_bx2
+mov_x8_sat r2.a, r1_bx2.b
dp3_sat r1.rgb, r0_bx2, r1_bx2
+mad_x2_sat r1.a, r2.b, r2.b, c0.b
mul r1.rgb, r1, r3
+mul r1.a, r1.a, r1.a
mul r4.rgb, r4, v0
+mul r1.a, r1.a, r1.a
mul r1.rgb, r1, r2.a
+mul r0.a, r1.a, r0.a
add r0.rgb, r1, r0.a
mul_sat r0, r0, r4

(Yes, I know, we are actually using the similar OpenGL extension from ATI, but I wrote and verified the code using DX9 ps-assembler and then converted it to function calls, which is a pain in the ass btw.)

As you can see, it's actually 7 arithmetic instruction pairs, which is under the limit of 8 per phase and most of the pairs contain both vector and scalar instruction. I don't actually know the speed difference on R2x0 between 1 and 2 phase shaders, but it shouldn't hurt anyway.

KimB · Jun 5, 2003

jpaana said:
A real life example is the bumpmap shader from Tenebrae:

ps_1_4
texld r0, t0
texld r1, t1
texld r2, t2
texld r3, t3
texld r4, t4
dp3_sat r2.rgb, r0_bx2, r2_bx2
+mov_x8_sat r2.a, r1_bx2.b
dp3_sat r1.rgb, r0_bx2, r1_bx2
+mad_x2_sat r1.a, r2.b, r2.b, c0.b
mul r1.rgb, r1, r3
+mul r1.a, r1.a, r1.a
mul r4.rgb, r4, v0
+mul r1.a, r1.a, r1.a
mul r1.rgb, r1, r2.a
+mul r0.a, r1.a, r0.a
add r0.rgb, r1, r0.a
mul_sat r0, r0, r4

Sine this is appears to be a pathological case for coissue, it would be very nice to see this shader's performance compared (in PS 1.x and PS 2.0) on the NV3x vs. R3xx.

DegustatoR · Jun 6, 2003

Ilfirin said:
I expect precision artifacts in the specular highlights from the low-precision normal maps.

First, I have to see these artifacts from "low-precision" FP16 normal maps.
Second, as I've sad, you can always opt to FP32 for these normal maps, if FP16 will really be that awful.

Rendering paths in Doom3 revisited..

Humus

Crazy coder

demalion

AndrewM

KimB

KimB

silhouette

KimB

silhouette

demalion

KimB

KimB

Entropy

DegustatoR

Ilfirin

demalion

jpaana

KimB

DegustatoR

Similar threads