Sorry if this was already posted, but Nvidia published an interesting recap of NV40 architecture on their developers site. It's an extract from the GPU GEMS 2 Book:
GPU Gems 2, The GeForce 6 Series GPU Architecture (Chapter 30)
I believe most of the facts you can read there are well known by forum regulars but it's a good read nonetheless.
About performance:
It shades some light on temporary registers issues:
It's also interesting to note that one of the authors of this document is Mr.Emmett Kilgariff, who used to work at 3Dfx, IIRC.
Dave Baumann wrote in this thread
As you can read through that thread Dave was speaking about some rumour/noise indicating nvidia could go multichip (with different ICs for different tasks) sometime in the future.
Actually also I heard something very vague about this regarding G70 or G80.
Even if it's just a remote rumour (and it could be blatantly false) what would be the main advantages/disadvantages of having different ICs for different tasks?
ciao,
Marco
GPU Gems 2, The GeForce 6 Series GPU Architecture (Chapter 30)
I believe most of the facts you can read there are well known by forum regulars but it's a good read nonetheless.
About performance:
Since I love inflated gigaflop/s figures it's nice to note that there are about 80 gigaflop/s computation power (without counting 16fp normalization..) just in the fragment processors and other 20+ gigaflops/s in the vertex shader processor.● 425 MHz internal graphics clock
● 550 MHz memory clock
● 600 million vertices/second
● 6.4 billion texels/second
● 12.8 billion pixels/second, rendering z/stencil-only (useful for shadow volumes and
shadow buffers)
● 6 four-wide fp32 vector MADs per clock cycle in the vertex shader, plus one scalar multifunction
operation (a complex math operation, such as a sine or reciprocal square root)
● 16 four-wide fp32 vector MADs per clock cycle in the fragment processor, plus 16
four-wide fp32 multiplies per clock cycle
● 64 pixels per clock cycle early z-cull (reject rate)
It shades some light on temporary registers issues:
Excessive internal storage requirements can adversely affect performance in the following
way: The shader pipeline is optimized to keep hundreds of fragments in flight given
a fixed amount of register space per fragment (four fp32×4 registers or eight fp16×4
registers). If the register space is exceeded, then fewer fragments can remain in flight,
reducing the latency tolerance for texture fetches, and adversely affecting performance.
Similarly, the register file has enough read and write bandwidth to keep all the units
busy if reading fp16×4 values, but it may run out of bandwidth to feed all units if
using fp32×4 values exclusively.
It's also interesting to note that one of the authors of this document is Mr.Emmett Kilgariff, who used to work at 3Dfx, IIRC.
Dave Baumann wrote in this thread
Is Mr. Kilgariff the ex 3Dfx guy who designed larg parts of NV40's shader core?Well, given the guy that designed large parts of NV40's shader core and presumably lead the design was also the guy that headed up the Rampage development at 3dfx those types of influences in the company must be fairly large.
As you can read through that thread Dave was speaking about some rumour/noise indicating nvidia could go multichip (with different ICs for different tasks) sometime in the future.
Actually also I heard something very vague about this regarding G70 or G80.
Even if it's just a remote rumour (and it could be blatantly false) what would be the main advantages/disadvantages of having different ICs for different tasks?
ciao,
Marco