Why doesn't DX9 support simple Quads?

Chalnoth said:
There are two possible reasons for what happened.

1. The use of integer formats was decided somewhere around two years before the release of DirectX 9. If this is the case, then nVidia cannot be at fault, because Microsoft didn't develop the DirectX 9 spec until at least 12-18 months into development. It would have cost too much to change by then, and, besides, nVidia had other problems at the time.
Sorry, that is still nVidias fault - for not being forward looking enough. You employ some pretty twisted logic to arrive at your bizarre conclusions.
Either way, you, along with a great many other people, seem to be operating under the assumption that just because the NV30 was finally released after DirectX 9, that nVidia actually had the ability to change the NV30's architcture in order to match what Microsoft decided was to be the DirectX 9 spec. That's simply ludicrous. It takes too much time and too much money to change a significant part of the spec like that.
NO!
We are not operating under that assumption. We are simply saying: NVIDIAS FAILINGS ARE THEIR FAULT.
Where as you are operating under the strange logicless assumption that somehow everyone else is supposed to bow to nVdias crappy design decisions.
API specs have always been designed after hardware. It's not nVidia who decided to violate the DirectX 9 spec. It's Microsoft who decided to write the DirectX 9 spec to not work well with the NV30-34.
OMG!
Get real :rolleyes:
 
Chalnoth said:
FUDie said:
Which is not Microsoft's problem. NVIDIA knew what the DX9 requirements were. NVIDIA chose not to implement them as efficiently as the competition. NVIDIA is to blame here, not Microsoft. I don't care about "woulda, shoulda, coulda", the point is that the DX9 spec. is FP24 minimum, with FP16 allowed on operations specified with _pp, that is it.
There are two possible reasons for what happened.

1. The use of integer formats was decided somewhere around two years before the release of DirectX 9. If this is the case, then nVidia cannot be at fault, because Microsoft didn't develop the DirectX 9 spec until at least 12-18 months into development. It would have cost too much to change by then, and, besides, nVidia had other problems at the time.
There one big problem with this whole idea: R300. If ATI could get it right, why not NVIDIA?
2. The original NV30 spec wasn't supposed to have inherent support for INT12, and was supposed to be quite fast at FP16 and FP32. The current NV30 was borne from a number of problems and mistakes that occurred. If this is the case, then it is nVidia's fault, but only in that their original design was too aggressive for the resources allocated (i.e. it was a management and/or engineering problem).
Wow, you at least concede there's one possibility that NVIDIA may be to blame, amazing.
Either way, you, along with a great many other people, seem to be operating under the assumption that just because the NV30 was finally released after DirectX 9, that nVidia actually had the ability to change the NV30's architcture in order to match what Microsoft decided was to be the DirectX 9 spec. That's simply ludicrous. It takes too much time and too much money to change a significant part of the spec like that.
There's one big problem with all of this: R300. Oh yeah, I said that earlier.
API specs have always been designed after hardware. It's not nVidia who decided to violate the DirectX 9 spec. It's Microsoft who decided to write the DirectX 9 spec to not work well with the NV30-34.
ROTFLMAO. You must live in a very interesting world, Chalnoth. You don't need INTs. FP16 can do everything FX12 can do (and then some) and FP16 is supported by DX9 with the _pp modifier. You just need floats to go fast. ATI did this, NVIDIA dropped the ball.

-FUDie
 
andypski said:
- Why do you (and others) regularly insist on picking CPU architectures as providing the example of what VPUs should or should not do?
That's not what I'm trying to do. What I'm trying to do is state that implementing different types is rather natural to programmers today, and therefore does not constitute a major challenge.

- What similarities in design do you see between VPUs and CPUs that leads you to believe that this is an appropriate or valid argument?
In particular, I don't think the hardware should replicate CPU's greatly. CPU's have a long legacy of being very inefficient processors, and GPU's have different aims for performance.

What I am stating is that support of different data types may be of great benefit for reasons of performance. One particular difference that I can imagine is that it may make sense for CPU to only support in hardware the highest-precision data type, and emulate all lower precision data types. But with a GPU, if you're only going to be using a specific precision very rarely, it makes more sense, given the already massive parallelism, to have multiple units operating at different precisions. The optimal amounts of each type of unit should be determined by software.

- Why do you not pick DSPs as the architectural precedent, or dedicated high-speed SIMD/vector processors such as Cray? How do you see VPUs as being more similar to CPUs than to these architectures.
I don't. I just don't know all that much about these architectures. I was attempting to draw more parallels to how people program for CPU's than to the actual underlying architecture.

And just because a shader is simple doesn't necessarily mean that it automatically requires low accuracy.
Which is the same as saying that just because a calculation is "complex" doesn't mean it automatically requries high accuracy. It all depends on the source art used, the calculation(s) performed, and the target for the data (which won't always be the currently 8-bit framebuffer).

Anyway, as for the performance of the NV3x, it seems that there are many other problems that are preventing it from achieving high performance that may well be very unrelated to the support of multiple precisions. One cannot pick out a single feature of the two cores and blame their relative performance on that one feature.

After all, I could just as easily say that the Radeon 9700's comparitively high shader performance in most current shader benchmarks is a direct consequence of its support of sparse-sampling MSAA.
 
Chalnoth said:
That's not what I'm trying to do. What I'm trying to do is state that implementing different types is rather natural to programmers today, and therefore does not constitute a major challenge.

Ok - I can see where you're coming from.

In particular, I don't think the hardware should replicate CPU's greatly. CPU's have a long legacy of being very inefficient processors, and GPU's have different aims for performance.

Yup, which is one of the reasons why equating them to CPUs, and introducing CPU-like constructs such as multiple precisions can be problematic.

What I am stating is that support of different data types may be of great benefit for reasons of performance. One particular difference that I can imagine is that it may make sense for CPU to only support in hardware the highest-precision data type, and emulate all lower precision data types. But with a GPU, if you're only going to be using a specific precision very rarely, it makes more sense, given the already massive parallelism, to have multiple units operating at different precisions. The optimal amounts of each type of unit should be determined by software.

This may be the case, but depending on how you handle it introducing additional precisions may potentially also create significant complexity in your VPU design - possible problems such as instruction scheduling, number of register read ports (to feed all the potential variations of possible instructions that can execute in one cycle), pack/unpack logic in the register and writeback paths etc.

How these and other elements of the design are addressed will decide how flexible your final design is, and whether you can actually get good utilisation out of the available resources. Do you end up achieving significantly more than one instruction per cycle or not? Can you only feed the additional fixed point units with the results from the FP units in a single clock, or do you need to fetch additional registers as inputs? Do they have separate write-back paths?

- Why do you not pick DSPs as the architectural precedent, or dedicated high-speed SIMD/vector processors such as Cray? How do you see VPUs as being more similar to CPUs than to these architectures.
I don't. I just don't know all that much about these architectures. I was attempting to draw more parallels to how people program for CPU's than to the actual underlying architecture.

And that's the underlying problem - drawing parallels to how people program for CPUs while understanding that CPUs (which are very different beasts, with complex scheduling to exploit instruction-level parallelism rather than data parallelism) are typically highly inefficient, and not suited to the tasks that VPUs (and vector or SIMD processors) are designed to excel at.

VPUs are designed to deal with massively parallel streams of data, attempting to closely balance the rate of execution of instructions to the IO characteristics to achieve good utilisation of their expensive execution resources, and therefore the silicon budget. Generalised resources that can be used without fail in every instruction are good, because they aren't ever wasted. Specialised resources that can only be used every n'th instruction (for whatever reason - scheduling, insufficient precision for the task etc.) introduce inefficiencies, and silicon area that may well be frequently wasted.

After all, I could just as easily say that the Radeon 9700's comparitively high shader performance in most current shader benchmarks is a direct consequence of its support of sparse-sampling MSAA.

Not quite sure how the support for sparse-sampling MSAA improves shader performance, or did I miss something?
 
andypski said:
This may be the case, but depending on how you handle it introducing additional precisions may potentially also create significant complexity in your VPU design - possible problems such as instruction scheduling, number of register read ports (to feed all the potential variations of possible instructions that can execute in one cycle), pack/unpack logic in the register and writeback paths etc.
Right. But these are engineering concerns. In the end, it all comes down to how well the problems are managed. I still say that the bottom line is performance, and even if they do make the engineering a bit more challenging, multiple data types offer the potential for higher performance, given that the lower-precision data types are useful for programmers (if the lower precisions are never used, of course, then they can't improve performance).

Personally, I think that this should be the breakdown of the various precisions for a future GPU:
16-bit integer (should also be a standard framebuffer format, would allow for high fidelity output, with a few bits to spare, and would be great for any sort of operation that doesn't require high dynamic range, such as blending)
32-bit floating-point (excellent for the vast majority of calculations)
64-bit floating-point (Some calculations will need this precision to work properly)

If such an architcture comes to light, it may not be prudent for performance to have explicit 16-bit integer support. However, I think it would be very smart to have both FP32 and FP64 units. Particularly at these precisions, it seems like the increased transistors to go all FP64 would just reduce the total number of functional units by too much.

And lastly, I think that FP64 is coming, largely because movie-quality shaders will need them occasionally. nVidia especially has been expressing interest in moving forward with the convergence of realtime and offline rendering. I don't know what FP64 would be needed for in games just yet, but I'm sure people will find a reason...
 
Chalnoth said:
Right. But these are engineering concerns. In the end, it all comes down to how well the problems are managed. I still say that the bottom line is performance, and even if they do make the engineering a bit more challenging, multiple data types offer the potential for higher performance, given that the lower-precision data types are useful for programmers (if the lower precisions are never used, of course, then they can't improve performance).

not only engineering concerns. it means as well engineering OVERHEAD. all the convertors, scedulers, interpreters and that are rather complex, and that for each and every instruction in your so-fast pixelshader..

perfect is not when you cannot add anything, its when you cannot remove anything anymore.

if you only support one datatype, you can remove all sort of type/size identification, thus gain performance. you can as well remove all sort of different implementations of the very same instructions, reducing chip size, transistor count, etc, wich you can, instead, use for, again, more performance.

instead of having 3 types internally, wich result in about a fpu and a ipu per shader, they could've split them, and replaced them by two fpu (the integer unit has twice the speed => could at least simulate a normal fpu).
result would be a 1typed pipeline at twice the speed of current fp32. and, if you look, thats then about what a radeon provides.

tell me one reason why integers are useful in a non-pointer world.

performance is no reason. in deep pipelined, parallelized, vector/streamprocessing units, generalizing (support for several types, etc), is NEVER a speedgain.

ati shows how a gpu can be made. simple => fast. nvidia shows how a gpu can be made wrong. complex, overloaded => slow.
 
Humus said:
JohnH said:
Oh hang on you're using it for sprites, so you'd have to submit in seperate prim calls. Though I still doubt that the extra data for teh indices is going to make that much difference to your performance, unless of course there's something dodgy with the HW you're running on.

Of course this would be fixed by a programmable prim processor.

John.

Well, 8.3% extra bandwidth may not be that big deal, but it bothers me that I have to do this at all. The same hardware can do it without this extra work or performanc penalty in OpenGL.

I suspect you'll find that in OGL the driver is just turning your quads into triangles. You might find that using D3D and indexed triangles, where you supply the indicies in a static index buffer is faster as the driver would no longer have to mess with the data.

Yes, a programmable primitive processor though would make things easier and more flexible, and I would be able to implement my stuff the way I want it instead of using the backward fixed function point sprite model. Though, then I would have the problem of backward compatibility

Would this not also be true of quads ?

John.
 
davepermen said:
ati shows how a gpu can be made. simple => fast. nvidia shows how a gpu can be made wrong. complex, overloaded => slow.
And I don't think it's that simple. nVidia may have been too aggressive with the NV3x, but that doesn't necessarily mean that the design decisions made were inherently bad ones to make. It may have just been that nVidia didn't forsee the problems that would occur and wasn't able to execute.

As for integer processing, I already conceded that it may not be smart to go for dedicated integer units. However, I see no reason not to go for lower-precision FP units, as long as, say, two FP16 units take about the same number of transistors as one FP32 unit.

Oh, and I think clarifying this quote will be relevant:
andypski said:
Chalnoth said:
After all, I could just as easily say that the Radeon 9700's comparitively high shader performance in most current shader benchmarks is a direct consequence of its support of sparse-sampling MSAA.
Not quite sure how the support for sparse-sampling MSAA improves shader performance, or did I miss something?
That was the point. Without a deeper knowledge of the actual engineering concerns related to how the NV3x shading architecture was designed, it's impossible to know how the various features affected performance.
 
but that doesn't necessarily mean that the design decisions made were inherently bad ones to make. It may have just been that nVidia didn't forsee the problems that would occur and wasn't able to execute.

Well, here I'd say that forseeing any problems which might occur is part of the design process. Therefore not forseeing some of the problems they encountered made their design decisions inherently flawed.

Perhaps much of the problem with NV30 et al was that they had difficulties organising and designing the multiple-precision nature of their core? ATI, on the other hand, had their single-precision core which had 'less' complexity and therefore brought up fewer design problems and ultimately higher performance and earlier availability.

We know that not every feature of the NV3X class chips is supported as yet despite many months work on the drivers whereas the R3XX chips have had very solid and stable drivers from the off. Could this be another indication that the mooted complexity of the NV3X has done nothing but cause more problems than a more simplified design?
 
Mariner said:
Well, here I'd say that forseeing any problems which might occur is part of the design process. Therefore not forseeing some of the problems they encountered made their design decisions inherently flawed.
I don't think so. I don't think that it is possible to forsee all problems that will occur during development. One big factor was that, from what Uttar's been posting, nVidia's original transistor budget was a fair bit higher than the transistor count of the final NV30. That alone could have resulted in a very large obstacle for getting the design implemented properly.

Anyway, I am stating this from the perspective of a physics student and one who programs as a hobby. Whenever you are exploring new territory, you just don't know what problems you'll encounter. Experience can help, but it's not a sure thing.
 
JohnH said:
I suspect you'll find that in OGL the driver is just turning your quads into triangles. You might find that using D3D and indexed triangles, where you supply the indicies in a static index buffer is faster as the driver would no longer have to mess with the data.
I rather suspect the hardware is responsible for the correct ordering, just as it is with triangle strips and fans. If you got that in hardware, quads are trivial to implement, just make the hardware automatically start a new triangle fan every four verices. Quad strips are even simpler, as they are almost identical to triangle strips (with the implication that you have an even number of vertices, and wireframe is different).

Yes, a programmable primitive processor though would make things easier and more flexible, and I would be able to implement my stuff the way I want it instead of using the backward fixed function point sprite model. Though, then I would have the problem of backward compatibility

Would this not also be true of quads ?

John.
No, quads can be done fast on any modern hardware, a PPP would have to be emulated on the CPU.
 
Why not just use plain triangles instead of quads for particules ? Sorry, I've never done any 3D programmings (no time), but from what little I know why wouldn't that work ?
 
Anyway, I am stating this from the perspective of a physics student and one who programs as a hobby. Whenever you are exploring new territory, you just don't know what problems you'll encounter. Experience can help, but it's not a sure thing.

Then you undertake an appropriate risk development plan(on Performance Cost and Time seperately for Likelyhood and Impact) with a realistic amount of optimisum bias. You don't need to know what all the problem are just a reasonable number then extrapolate. Even if it's done on a project never carried out before (and I assume nvidia done some preliminary R&D/Studies for the NV3X) you can still get a reasonable picture. Once done you can start making tradeoffs.

Never start a project without analysing risk first.

Usual rule is about 15% of total project cost.
 
Right, which makes sense. But that doesn't mean that you're going to succeed in forseeing all of the pertinent problems. In particular, there's always the possibility of a problem biting back...hard. Otherwise there'd be no missed release schedules, no cancelled products, etc.
 
Chalnoth said:
I don't think so. I don't think that it is possible to forsee all problems that will occur during development. One big factor was that, from what Uttar's been posting, nVidia's original transistor budget was a fair bit higher than the transistor count of the final NV30. That alone could have resulted in a very large obstacle for getting the design implemented properly.

Anyway, I am stating this from the perspective of a physics student and one who programs as a hobby. Whenever you are exploring new territory, you just don't know what problems you'll encounter. Experience can help, but it's not a sure thing.
ok, so you encounter problems, and your arch rival does not.
why should the entire industry bow to nVidias screw up?
 
Chalnoth said:
Mariner said:
Well, here I'd say that forseeing any problems which might occur is part of the design process. Therefore not forseeing some of the problems they encountered made their design decisions inherently flawed.
I don't think so. I don't think that it is possible to forsee all problems that will occur during development. One big factor was that, from what Uttar's been posting, nVidia's original transistor budget was a fair bit higher than the transistor count of the final NV30. That alone could have resulted in a very large obstacle for getting the design implemented properly.
Since when is Uttar a spokesperson from NVIDIA? If you're basing your whole line of thinking on some quite unsubstantiated rumors, then you're very gullible.

-FUDie
 
Althornin said:
ok, so you encounter problems, and your arch rival does not.
why should the entire industry bow to nVidias screw up?
Supporting integer data types isn't doesn't sound, to me, like the entire industry bowing to nVidia's screw up.

And it shouldn't be up to game developers to make the optimizations. I would hope that it's mostly nVidia's developer relations program is the group that's going to need to take up the slack.
 
Chalnoth said:
Supporting integer data types isn't doesn't sound, to me, like the entire industry bowing to nVidia's screw up.
Your argument sure does though.
You argument is:
1) nVidia failed to anticipate DX spec. so their shader part sucks speedwise
2) Ergo, the specifications are at fault.


Do you see that that makes no sense? Another company has demonstrated that nVidias approach is NOT THE BEST. Even in PS1.1 ops, nVidia falls behind! (all integer!)

And it shouldn't be up to game developers to make the optimizations. I would hope that it's mostly nVidia's developer relations program is the group that's going to need to take up the slack.
Why can you not face the idea that your hero failed?
 
Right, which makes sense. But that doesn't mean that you're going to succeed in forseeing all of the pertinent problems. In particular, there's always the possibility of a problem biting back...hard. Otherwise there'd be no missed release schedules, no cancelled products, etc.

Somebody should tell nvidia then because from everything I've read says they adopted a high risk approach.

Overly complex design, unrealistic design goals, , development of a design which could only effectively work on an unproven manufacturing process, lack of control over systemn requirments. Heck from what Uttar has said about the transistor budget they didn't even have basic risk contingency mechanisms in place.

You don't need to have forseen all the problems, but major ones which have an impact on you criticial path...well you apply what's a vatriety of techniques to basically amplify the ones you got. Attempting to claim that the problems which occurred with the NV3X project weren't forseeable is just plain ridiculous (how many chips have they designed...).

Nvidia have only themselves to blame for the NV30.
 
Xmas said:
JohnH said:
I suspect you'll find that in OGL the driver is just turning your quads into triangles. You might find that using D3D and indexed triangles, where you supply the indicies in a static index buffer is faster as the driver would no longer have to mess with the data.
I rather suspect the hardware is responsible for the correct ordering, just as it is with triangle strips and fans. If you got that in hardware, quads are trivial to implement, just make the hardware automatically start a new triangle fan every four verices. Quad strips are even simpler, as they are almost identical to triangle strips (with the implication that you have an even number of vertices, and wireframe is different).
Why would you add HW for something that is hardly ever use ? They may be simple to implement, but you end up expanding the HW test matrix for no real reason. Anyway, as I said, I would be surprised if that much HW actually directly supported them (3DLabs maybe?), as the gain is so small.
Yes, a programmable primitive processor though would make things easier and more flexible, and I would be able to implement my stuff the way I want it instead of using the backward fixed function point sprite model. Though, then I would have the problem of backward compatibility

Would this not also be true of quads ?

John.
No, quads can be done fast on any modern hardware, a PPP would have to be emulated on the CPU.

What modern HW do you _know_ directly supports quads ? (presence of them in OGL implies nothing). Yes current GPU's would have to do PPP on the host, but thats where the work for turning your quads into triangles is probably being done anyway. I would also hope that if a PPP was present that it would be used for things like this as it allows a further reduction in BW (1 vert vs 4 for this example).

I did have a question burried in one of my previous post's that no ones answered, so I'll unburry it : What do you need quads for given that the extra BW requirements for indices are unlikley to have much, if any impact, on performance ?

Later,
John.
 
Back
Top