Uh, excuse me for laughing out loud, but CHEAP way of doing things?!? Have you checked recently the price of the P4 3GHz? If it's one thing that chip is NOT, it's cheap, man!
Well you are mistaking manufacturing price ( the one I am referring at and the price that 3 GHz Pentium 4 will fly once Prescott releases and the same will happen to Prescott when the new AMD chip shipts to market with good benchmarks and lower price... ) with sale price...
Intel is selling its chips at considerably more than it costs them to produce them ( they have huge volumes too ), in the industry they are appreciated for the high ASPs ( Average Sale Price, IIRC ), appreciated on the economic point of view...
It doesn't mean they could lower the final price of the chip much...
plus you're forgetting something, those Pentium 4's are going to find their way sooner or later into pre-built PCs in stores, at DELL, HPaq online shops... before NV30s and R300s... I have seen in the past a lot of computers shipping to the masses with high-end CPUs and quite low-end graphics cards ( selling on the CPU name mainly )...
I was considering a system with a Pentium 4 already in it and I was thinking about the integrated graphic hip you could pair it to to support DX9+ for example... and as far as T&L is concerned I saw that an optimized implementation could run decently on the host CPU saving quite a bit of silicon area on the GPU and allowing the GPU to be clocked higher and increase its fill-rate...
And while your approach is sorting polygons to do deferred transforms, it's not transforming. And while it is transforming, it is not running game code. You're clogging the CPU with lots of data shuffling tasks and floating-point calculations which will make a high-end system perform outright badly when all things are considered.
HT... buzzword or useful feature if we have several tasks that want to run at the same time and we want as less pain from context switching as possible...
ANSWER: BEEEP-> useful feature
The SSE/SSE2 units would run T&L and I think that sorting the vertex stream could be something that the Pentium 4 chip could do quite fast ( 3+ GHz gibve you a lot of cycles to spend
)
The two ALUs and the x87 FPU could be dedicated to work on mostly game code and physics...
trust me it is possible... and if you do not believe me well look at the Dreamcast...
PVR GPU with no T&L engine + SH-4 processor with SSE type Vector Unit ( of course it is not so similar, but it is built on top of the RISC FPU and the two cannot coexist at the same exact time... )... that ended up working ok... the SH-4 was clocked at 200 MHz and had PC100 SDRAM as main RAM basically...
Deferred transforming was thought, in this example, to help reduce the effective T&L load... control points of the HOS are considerably less than the vertices of those surfaces when fully tesselated... if we can eliminate the hidden surfaces or portions of them we will reduce the amount of surfaces to tesselate ( less load on the CPU as far as tesselation is concerned ) and the amount of triangle we have to run complex vertex shaders on
And BTW, if we do deferred T&L we also eliminate the need of a complex occlusion detection mechanism in our GPU simplfying it design even further ( clock speed going up
) saving more money on the GPU