Radar1200gs: Stop trying to look stupid, cause your primary argument is flawed. You say the 5200 is the most popular DX9 card, and that FP16 is a gimmick on it.
The only usable format on the 5200 is FX12. FP16 is nearly exactly three times as slow as FX12 on the 5200, and FX12 already isn't too fast. DX9 PS2.0. doesn't expose this functionality, but I believe NVIDIA forces FX12 on the 5200 anyway.
However, I do believe that developers should put "all program in FP16" hints for not-too-complex DX9 shaders, because not only the NV3x and S3 benefit from it, but also the NV4x(!)
Regarding the 5200/5600/5800, I say we should we let NVIDIA use FX12 everywhere on them. It would be a disservice to the poor users of these cards not to let them do that, as it's the only way for them to have playable framerates, although with obviously lower IQ.
---
Demirug: That makes sense to me, although I find it extremely stupid from an engineering point of view. But then again, so are the whole register usage penalties, so if it makes sense if you see what I mean
Regarding the NV40, what I meant is that we would move more towards an ILDP (Instruction Level Distributed Processor), and that the gatekeeper would drastically evolve. Problem is, perhaps I'm just being too ambitious and thinking too much about what NVIDIA will have to do in the NV50 if they want to be efficient, because they certainly don't HAVE to do that in the NV40...
Very basically speaking:
- The gatekeeper can now dispatch and receive several quads at the same time (fixed number though, of course).
- The gatekeeper can send the quads to a few different units(!same number of units as number of quads it can send/get!)
- All of these units have a loopback mechanism to one of the input paths of the gatekeeper.
In the most basic implementation, there's just one path for arithmetic and one for texturing. In the most complex implementation, A.K.A. a true ILDP, each unit has such a path, resulting in an optimal usage of all units at all times.
This is risk-free IMO. For example, let us say the arithmetic path takes 100 slots and texturing one 250 slots. Even if the gatekeeper can send multiple quads at a time, it can never send more than one to a specific path, or get more than one from the same path, in a single cycle.
The idea here is not to reduce the maximum number of slots used. It's to reduce the *average*.
Uttar