ATTILA is a GPU simulator, not a hardware design. They don't have to route data around and worry about various HW limitations.
Duh. However, ATTILA is afaik what influenced Intel to implement triangle setup in their shader core for their current SM3/SM4 IGP architecture. The only reason why I was mentioning it, anyway, is that it's the only source I've ever seen for an arithmetic shader-like 'triangle setup' program.
It doesn't matter, because the data flow doesn't fit the framework of the shaders.
I am rather skeptical that it doesn't loosely fit the framework of the geometry shader. Of course, the geometry shader's peak rate is rather... unimpressive right now, but you can't get away from the fact that in the future it will progressively have to go incrementally faster. That same hardware could be partly shared for the data flow & synchronization of triangle setup (in fact, the two could possibly be merged into the same program).
You already have programmability with point sampling and/or fetch4. Going beyond that is rather pointless.
Getting full programmability for free would be absurd, yes. However, point sampling and/or fetch4 is so far from optimal it's pretty funny. Okay, let's look at this another way: what do you need to do texture filtering in the shader core?
- The texture colors and the weights for each bilinear operation, for every colour.
- The number of bilinear operations to perform and the n-1 weights between them.
For an INT8 texture, the colors & weights are likely all being transmitted in 16-bit format (and converted by a 'free' converter somewhere). Note that I am assuming this (rather than FP32 for everything) in order to be pessimistic for my own estimates. Therefore, it should be possible with proper packing (ugh, I know) to transmit the colors in 2 cycles and the weights in 1 cycle. So 3 cycles total for bilinear, 6 cycles for trilinear. This isn't fundamentally different from how the same paths are reused for a 4xINT8 texture or a 2xFP16 texture IMO...
I'm not saying no new scheduling or routing logic would be required. Duh, of course it would. However, I think you are overestimating its size, and underestimating the logic it could save by always running certain modes in And much more importantly, I think history will prove you wrong...
Arun, you need to get a handle on the size of arithmetic logic. Remember that R420, at 160M transistors, has the same setup rate, HiZ rejection rate, and scanline rasterization rate of RV770. You don't need full FP32 arithmetic for most of the operations. You don't need access to a shader program or pick values from 64K of register space.
I think I have a much better handle on that than you think I do...
Furthermore, you strangely put HiZ and scanline rasterization into the equation; yes, those are things that are also hard to parallelize, but unlike triangle setup their computational requirements are massively different from the shader core. It's all about a large number of very small operations; and obviously doing that in 'software' would be pure madness: there's nothing to gain there.
On the other hand, while triangle setup is indeed hard to parallelize, the operations are much higher precision (in fact, the best arguement I've heard so far for not doing it in hardware is the shader core is that FP32 just isn't enough! Although I do wonder if that doesn't depend on the algorithm given that Intel seems to be doing it just fine...) and it looks much more like a classic shader program. So yes, it's hard to do, but the reward may very well be worth the initial R&D cost and the slight die size cost.
My point is really this: it will be necessary in the future to be able to parallelize it; it's just not fast enough otherwise for a huge 500mm² 40nm chip, you can't get around it. And once you did find an acceptable way to parallelize it, then simply adding more fixed-function hardware for the job would be very expensive and inefficient because (unlike HiZ or rasterization) those aren't just small and cheap operations at all. So like it or not, I don't see how you can get around having to do this within the next 2 years...
Die cost isn't the issue here. It's just a matter of tiptoeing through the minefield of parallelizing a part of the graphics pipeline that has always been serial. There are so many endcases that cause incorrect output if you aren't perfect.
Where did I say it was easy? However, creating an highly efficient unified shader core isn't easy either, nor is a good redundancy mechanism. That doesn't mean you should give yourself the luxury of avoiding them until you're really really forced to; otherwise you'll just turn out the way of 3dfx.
I would be very surprised if neither IHV considered it very seriously for the DX11 generation anyway given this:
http://www.gamedev.net/columns/events/gdc2006/article.asp?id=233
I could definitely still be wrong here, and if I ever change opinion I'll gladly admit so, but I certainly don't think the arguements are anywhere near as clearcut as you make them out to be; of course, I'll also gladly admit that they're not anywhere near as clearcut as *I* originally made them out to be!