Skinner said:Well nv did something with the GF6800go, with 12 pipes it keeps almost up with a GF6800U, could be a sign for things to come as far as efficienty is conserned with the G70/80...
IIRC the relative drop of 4xAA (w/out AF) is less in NV40 than in R420.trinibwoy said:Well if Nvidia could do something about the AA performance hit that could be substantial too.
Vertex Fetch 1 Rasterizer + quad pool 1
\ _ Vertex Shaders _/ \_ Pixel Shaders - ROPs
/ \ /
Vertex Fetch 2 Rasterizer + quad pool 2
psurge said:No they aren't. Show me a GPU that can render UT and Doom3 simultaneously, and not by multitasking (preemptive or otherwise).
AFAICS almost everything in current GPUs is based on data parallelism and pipelining. This is not the same as control parallelism - so when I say SMT in reference to a GPU, I mean a GPU capable of handling separate command streams from completely different applications in parallel.
psurge said:No they aren't. Show me a GPU that can render UT and Doom3 simultaneously, and not by multitasking (preemptive or otherwise).
AFAICS almost everything in current GPUs is based on data parallelism and pipelining. This is not the same as control parallelism - so when I say SMT in reference to a GPU, I mean a GPU capable of handling separate command streams from completely different applications in parallel.
A "multi-core" GPU wouldn't even be more efficient at it, since the most efficient way of doing it would be to never render for more than one application at any one time.
More states in flight requires more cache, though, leading back to the same coherency problem. In an architecture like the NV4x, for example, the two states would have to share the limited number of registers.psurge said:Obviously such a system would require either duplication of units (thus no state change overhead), or units which can have 2 "states" in flight simultaneously.
Sure, but what if each program wants to do branching?psurge said:Well, the NV4x works on batches of quads, so the quad pools would either have to be duplicated (register wasteage as you say), or a single pool would have to be able to contain quads from different programs. Basically the pixel shader control unit would have to be able to deal with quads that have different program counters, and dispatch from 2 programs (or program locations) simultaneously. This sounds like something you have to do anyway for efficient branching, and to avoid stalling the pixel shader if you just don't have hundreds of pixels to run the same shader on.