NV40 architecture and multichip GPUs

nAo · Apr 3, 2005

Sorry if this was already posted, but Nvidia published an interesting recap of NV40 architecture on their developers site. It's an extract from the GPU GEMS 2 Book:
GPU Gems 2, The GeForce 6 Series GPU Architecture (Chapter 30)

I believe most of the facts you can read there are well known by forum regulars but it's a good read nonetheless.
About performance:

● 425 MHz internal graphics clock
● 550 MHz memory clock
● 600 million vertices/second
● 6.4 billion texels/second
● 12.8 billion pixels/second, rendering z/stencil-only (useful for shadow volumes and
shadow buffers)
● 6 four-wide fp32 vector MADs per clock cycle in the vertex shader, plus one scalar multifunction
operation (a complex math operation, such as a sine or reciprocal square root)
● 16 four-wide fp32 vector MADs per clock cycle in the fragment processor, plus 16
four-wide fp32 multiplies per clock cycle
● 64 pixels per clock cycle early z-cull (reject rate)

Since I love inflated gigaflop/s figures it's nice to note that there are about 80 gigaflop/s computation power (without counting 16fp normalization..) just in the fragment processors and other 20+ gigaflops/s in the vertex shader processor.
It shades some light on temporary registers issues:

Excessive internal storage requirements can adversely affect performance in the following
way: The shader pipeline is optimized to keep hundreds of fragments in flight given
a fixed amount of register space per fragment (four fp32Ã—4 registers or eight fp16Ã—4
registers). If the register space is exceeded, then fewer fragments can remain in flight,
reducing the latency tolerance for texture fetches, and adversely affecting performance.

Similarly, the register file has enough read and write bandwidth to keep all the units
busy if reading fp16Ã—4 values, but it may run out of bandwidth to feed all units if
using fp32Ã—4 values exclusively.

It's also interesting to note that one of the authors of this document is Mr.Emmett Kilgariff, who used to work at 3Dfx, IIRC.
Dave Baumann wrote in this thread

Well, given the guy that designed large parts of NV40's shader core and presumably lead the design was also the guy that headed up the Rampage development at 3dfx those types of influences in the company must be fairly large.

Is Mr. Kilgariff the ex 3Dfx guy who designed larg parts of NV40's shader core?
As you can read through that thread Dave was speaking about some rumour/noise indicating nvidia could go multichip (with different ICs for different tasks) sometime in the future.
Actually also I heard something very vague about this regarding G70 or G80.
Even if it's just a remote rumour (and it could be blatantly false) what would be the main advantages/disadvantages of having different ICs for different tasks?

ciao,
Marco

RejZoR · Apr 3, 2005

I was thinking long ago about graphic cards with dedicated chips for specific task. Like 1 chip for only FSAA jobs (highly optimized for doing this),another one for i don't know,Pixel Shaders/Vertex Shaders,another one for texture filtering and so on. Don't know if it can be built in such way or you can do this in one chip,but idea itself is cool hehe

Ailuros · Apr 3, 2005

Emmet Kilgariff used to be the leading engineer at 3dfx if my memory doesn't betray me.

From the chapter29 pdf for the GPUGems2 book:

Chapter 30, â€œThe GeForce 6 Series GPU Architecture,â€ by Emmett Kilgariff and Randima Fernando of NVIDIA, describes in detail the design of a current stateof-the-art graphics processor, the GeForce 6800. Cowritten by one of the lead architects of the chip, this chapter includes many low-level details of the hardware that are not available anywhere else. This information is invaluable for anyone writing high-performance GPU applications.

As for your question I don't know, yet NVIDIA AFAIK was supplying Quantum3D with boards, since they started co-operating (I think it was in 2001?), that ended up in multi-board configs like in the Independence Systems.

I even think that some former rampage engineers made it over to Quantum3d.

Geo · Apr 3, 2005

Ailuros said:
Emmet Kilgariff used to be the leading engineer at 3dfx if my memory doesn't betray me.

Emmett Kilgariff (Former VP of Engineering, 3dfx Rampage, now NVIDIA Architecture Manager in charge of NV40â€™s texture and shader core) with David Kirk and Tony Tamasi

http://www.beyond3d.com/previews/nvidia/nv40/index.php?p=9

Blazkowicz · Apr 3, 2005

RejZoR said:
I was thinking long ago about graphic cards with dedicated chips for specific task. Like 1 chip for only FSAA jobs (highly optimized for doing this),another one for i don't know,Pixel Shaders/Vertex Shaders,another one for texture filtering and so on. Don't know if it can be built in such way or you can do this in one chip,but idea itself is cool hehe

maybe they'll go the route of the Rampage + SAGE combination? (as seen on Wildcat Realizm 800 too). that was rumored long ago

One chip for vertex shader/geometry stuff, and one or two chips in SLI for pixel shading and other pixel stuff. That way the PS3 ends up with a pixel only GPU, SLI is more well done (esp. if they use tiling as on R300 and realizm?), and that explains nvidia not willing to do physical unification of vertex and pixel ALUs.

(wild guess, I'm not The Inqu, I don't have 'insider's information"

)

Joe DeFuria · Apr 3, 2005

Blazkowicz_ said:
One chip for vertex shader/geometry stuff, and one or two chips in SLI for pixel shading and other pixel stuff. That way the PS3 ends up with a pixel only GPU, SLI is more well done (esp. if they use tiling as on R300 and realizm?), and that explains nvidia not willing to do physical unification of vertex and pixel ALUs.

It wouls also give nVidia a different way to address consumer vs. "professional" boards.

If such an architecture was constructed so they could combine a relatively arbitrary number of vertex processor chips vs. pixel processor chips, they could create board level designs such as the following...assuming professional boards tend to focus more on vertex power than pixel power:

1) 3 Vertex preocessors + 1 Pixel processor: Professional board
2) 1 Vertex processor + 2 Pixel processors: Consumer board

Etc.

Almost the exact opposite approach of where ATI would appear to be going (based on our speculation), which might be something like:

1) 1 to 2 identical chips: consumer board
2) 2+ identical chips: professtional board

(ATIs boards would dynamically allocate vertex / pixel power as needed.)

Both approaches would have different pros and cons. Would be most interesting if this actually turns out to be the path they are both going down.

_xxx_ · Apr 3, 2005

nAo said:
...what would be the main advantages/disadvantages of having different ICs for different tasks?

Advantages are pretty clear - 2 or 3 smaller chips would be much cheaper to produce and deliver much better yields than one large chip. It also enables them to combine those chips for anything between low-end (less chips) and high-end for HC-gamers with more chips. The package like this will also be much easier to keep cool and thus enable higher clocks.

Disadvantages are more complicated board design, bigger boards, fixed number of pipelines (less flexible then the "adaptive" version from the competition, could be either good or bad though). Probabaly more problems bringing this to laptops andf lowest-end cards.

Both ways it could be good or bad, we'll see it when it arrives

Geo · Apr 3, 2005

My, my, the worm turns. Not very long ago at all such suggestions around here would either have the poster pilloried as a quack, or the natives fleeing in terror from the 3dfx-bot.

nutball · Apr 3, 2005

I can't help but wonder whether the sort of proximity communication tech that Sun (maybe others) have been trumpeting would be useful in such a multi-(chip|core|die) solution.

This:

http://research.sun.com/sunlabsday/docs/talks/1.02_Drost.pdf

I guess it's a way off (a few years?) but might help to reduce the complexity of the boards.

ondaedg · Apr 3, 2005

I would like either IHV to incorporate a chip dedicated to AA. I don't know if it is possible, but I am willing to spend a few extra bucks for it.

Tahir2 · Apr 3, 2005

ondaedg said:
I would like either IHV to incorporate a chip dedicated to AA.

Large amounts of the transistor count for both the NV4x and R42x are dedicated to AA - youtr wish has already been realised.

overclocked · Apr 3, 2005

Large amounts of the transistor count for both the NV4x and R42x are dedicated to AA - youtr wish has already been realised.

Yes that logic takes up lots more relative to say the fragment-pipes.

nAo · Apr 3, 2005

I wonder what kind of bus could be used to connect a multichip configuration..
Hypertransport? FlexIO?
There are good possibilities that nvidia PS3 GPU will connect to CELL via a FlexIO interface.
Maybe Nvidia is going to re-use that technology in some pc-part. Ok..I'm just speculating too much here..

ondaedg · Apr 4, 2005

Tahir said:
ondaedg said:

I would like either IHV to incorporate a chip dedicated to AA.

Click to expand...

Large amounts of the transistor count for both the NV4x and R42x are dedicated to AA - youtr wish has already been realised.

This is already been noted. I can't help but wonder though what a chip dedicated to just AA could do. 4x SS at 4xMS speeds would be a nice result.

Tahir2 · Apr 4, 2005

Well good AA (better than we have now) relies on massive bandwidth and Video RAM space. You can't add a "chip" and expect faster AA - it goes with the whole architecture.

trinibwoy · Apr 4, 2005

Although it's a long shot, it would be a nice twist if Nvidia did something with XDR in the PC space. Might provide enough raw bandwidth for 8xAA.

Simon F · Apr 4, 2005

_xxx_ said:
Advantages are pretty clear - 2 or 3 smaller chips would be much cheaper to produce and deliver much better yields than one large chip. .

Something you haven't factored in is that separate chips need to be connected by some kind of (probably very wide) data bus. That means a lot of area dedicated to the pads and to drive the external logic signals which also requires a lot more power.

_xxx_ · Apr 4, 2005

Simon F said:
_xxx_ said:

Advantages are pretty clear - 2 or 3 smaller chips would be much cheaper to produce and deliver much better yields than one large chip. .

Click to expand...

Something you haven't factored in is that separate chips need to be connected by some kind of (probably very wide) data bus. That means a lot of area dedicated to the pads and to drive the external logic signals which also requires a lot more power.

I have. Later in my post I said it leads to more complicated board design etc.

mboeller · Apr 4, 2005

_xxx_ said:
Advantages are pretty clear - 2 or 3 smaller chips would be much cheaper to produce and deliver much better yields than one large chip. It also enables them to combine those chips for anything between low-end (less chips) and high-end for HC-gamers with more chips. The package like this will also be much easier to keep cool and thus enable higher clocks.

Disadvantages are more complicated board design, bigger boards, fixed number of pipelines (less flexible then the "adaptive" version from the competition, could be either good or bad though). Probabaly more problems bringing this to laptops andf lowest-end cards.

Both ways it could be good or bad, we'll see it when it arrives

I would expect some form of MCM which can have from 1-4 (maybe 8 ) processing elements (ALU, texturing) and maybe 1-4 DRAM or SRAM chips with an very high on-MCM bandwidth. So IMHO the board design will look the same as for graphics chips now.

London Geezer · Apr 4, 2005

_xxx_ said:
Simon F said:

_xxx_ said:

Advantages are pretty clear - 2 or 3 smaller chips would be much cheaper to produce and deliver much better yields than one large chip. .

Click to expand...

Something you haven't factored in is that separate chips need to be connected by some kind of (probably very wide) data bus. That means a lot of area dedicated to the pads and to drive the external logic signals which also requires a lot more power.

Click to expand...

I have. Later in my post I said it leads to more complicated board design etc.

Welcome to the Understatement of the Year Awards 2005. This year's Nominees are....

NV40 architecture and multichip GPUs

nAo

Nutella Nutellae

RejZoR

Ailuros

Epsilon plus three

Geo

Mostly Harmless

Blazkowicz

Joe DeFuria

_xxx_

Geo

Mostly Harmless

nutball

ondaedg

Tahir2

overclocked

nAo

Nutella Nutellae

ondaedg

Tahir2

trinibwoy

Meh

Simon F

Tea maker

_xxx_

mboeller

London Geezer

Similar threads