Originally posted by SA:
Highly scalable problems such as 3d graphics and physical simulation should get near linear improvement in performance with transistor count as well as frequency. As chips specialized for these problems become denser they should increase in performance much more than CPUs for the same silicon process. This means moving as much performance sensitive processing as possible from the CPU to special purpose chips. This is quite a separate reason for special purpose chips than simply to implement the function directly in hardware so as to be able to apply more transistors to the computations (as mentioned in my previous post) It also applies to functions that require general programmability (but are highly scalable). General programmability does not preclude linear scalablity with transistor count. You just need to focus the programmability to handle problems that are linearly scalable (such as 3d graphics and physical simulation). In makes sense of course to implement as many heavily used low level functions as possible directly in hardware to apply as many transistors as possible to the problem at hand.
The other major benefit from using special purpose chips for highly scalable, computation intensive tasks is the simplification and linear scalablity of using multiple chips. This becomes especially true as EDRAM arrives.
The MAXX architecture requires scaling the external memory with the number of chips, so does the scan line (band line) interleave approach that 3dfx used. With memory being such a major cost of a board, and with all those pins and traces to worry about, it is a hard and expensive way to scale chips (requiring large boards and lots of extra power for all that external memory). The MAXX architecture also suffers input latency problems limiting its scalability (you increase input latency by one frame time with each additional chip). The scan line (band line) method also suffers from caching problems and lack of triangle setup scalability (since each chip must set up the same triangles redundantly).
With EDRAM, the amount of external memory needed goes down as the number of 3d chips increase. In fact, with enough EDRAM, the amount of external memory needed quickly goes to 0. EDRAM based 3d chips are thus ideal for multiple chip implementations. You don't need extra external memory as the chips scale (in fact you can get by with less or none), and the memory bandwidth scales automatically with the number of chips.
To make the maximum use of the EDRAM approach, the chips should be assigned to separate rectangular regions or viewports (sort of like very large tiles). The regions do not have their rendering deferred (although they could of course), they are just viewports. This scaling mechanism automatically scales the computation of everything: vertex shading, triangle setup, pixel operations, etc. It does not create any additional input latency, allows unlimited scalablity, and does not require scaling the memory as required by the previously mentioned approaches.
Tilers without EDRAM also scale nicely without needing extra external memory. They are in fact, the easiest architecture to scale across multiple chips. You just assign the tiles to be rendered to separate chips rather than the same chip. The external memory requirements while remaining constant, do not drop however, as they do with EDRAM. The major problem to deal with is scaling the triangle operations as well as the rendering. In this case, combining the multi-chip approach mentioned for EDRAM with tiling solves these issues. You just assign all the tiles in a viewport/region to a particular chip. Everything else is done as above and has the same benefits.
In my mind, the ideal 3d card has 4 sockets and no external memory. You buy the card with one socket populated at the cost of a one chip card. The chip has 32 MB of EDRAM, so with 1 chip you have a 32MB card. When you add a second chip you get a 64MB card with double the memory bandwidth and double the performance. For those who go all out and decide to add 3 chips, they get 128 MB of memory, and quadruple the memory bandwidth and performance. Ideally, the chip uses some form of occlusion culling such as tiling, or hz buffering with early z check, etc. Using the same compatible socket across chip generations would be a nice plus.
In the long run I agree with MFA. Using scene graphs or a similar spatial heirarchy simplifies and solves most of these problems, including accessing, transforming, lighting, shading, and rendering, only what is visible. They also simplify the multiple chip and virtual texture and virtual geometry problems. We will need to wait bit longer for it to appear in the APIs though.
There are indeed two problems generally associated with partitioning the screen across multiple chips. Load balancing, and distributing the triangles to the correct chip. Both have fairly straight forward, very effective solutions, though I can't mention the specifics here.
Those are some good comments, MFA. However, there is no need to defer rendering and no need for a large buffer. Each chip knows which vertices/triangles to process, without waiting.