Article on CineFX/NV30 Architecture (German)

Xmas

Porous
Veteran
Supporter
Some in-depth views of the CineFX architecture can be found over at 3DCenter. I think this might be interesting for some of you. Currently only a German version is available but an English translation is coming soon (maybe tomorrow... maybe ;) )
 
Xmas said:
Some in-depth views of the CineFX architecture can be found over at 3DCenter. I think this might be interesting for some of you. Currently only a German version is available but an English translation is coming soon (maybe tomorrow... maybe ;) )

Interesting? Indeed!
I'm at Page 2 right now ( using babelfish... So I can't really guarantee I understood anything yet ;) ) and there's just one thing that I seriously wonder about:
If EXP, LIT and LOG can be done in parallel with tables, how come none of us have even thought about the fact it could be done in parallel and tried it with a test program? I've never seen any tests...
Or could it be the driver isn't exposing this yet? ( Or that the paper-to-silicon stage failed for that - too )


Uttar

EDIT: BTW, anyone noticed yet that this "is fundamentally a 8 pipelines design"? ;) :LOL:
 
mboeller said:
No;

fundamentally it's an 1 Pipeline-design, able to output 4 Pixel/cycle due to the SIMD-architecture.

Hmm, I realized that.
That was *sarcasm* ;)
I was just joking at nVidia insisting on the "8 pipelines" thing even though the design is fully based around the very idea of 4 pixels in 1 pipeline.

As for the zixel trick, it's obvious that it's a bypass path also working on quads, but does not do any shading or texturing in it.

After reading the whole article, regarding the NV40:
The NV40 is considered as a "8 pipelines design", so the first question is:
8x1 pixels, 2x4 pixels ( 2 quads ) or 1x8 pixels.

Looking at the goal of the NV50 being a full ILDP, you may here suppose more things will slowly move towards that ideal. 2 quads would make sense here, with the functional units of both quads sharing their stuff, and working with a few instructions of difference to have more possible parallelism.
So for example if the RSQ units of one pipeline are idle, and that you got to do two RSQ in a row in one, you'd have no performance penalty - or you could have slightly different units in both paths and use the first path's units for all RSQ work.
Evantually, since it seems the the ADD and MUL units are not really united and got cache between them, you could even do 2 ADDs and 2 MULs in parallel.

Or you could do that in a completely different way, going away from the CineFX architecture, and that might infact be much more sensible, because this approach seems fairly icky.

What still amuses me is that David Kirk gave massive hints to all that in the Extreme Tech preview at launch with comments from him & Tabasi ( sp? ), but I guess there were too many contradictory reports and he was too vague for us to figure it out :(


Uttar
 
Uttar said:
After reading the whole article, regarding the NV40:
The NV40 is considered as a "8 pipelines design", so the first question is:
8x1 pixels, 2x4 pixels ( 2 quads ) or 1x8 pixels.

It will not that easy to use quads with PS 3.0 because every pixel can execute a different part of the programm. It is no impossible but in the case that each pixel in a quad go a different way you have a problem to solve.

Uttar said:
Looking at the goal of the NV50 being a full ILDP, you may here suppose more things will slowly move towards that ideal. 2 quads would make sense here, with the functional units of both quads sharing their stuff, and working with a few instructions of difference to have more possible parallelism.
So for example if the RSQ units of one pipeline are idle, and that you got to do two RSQ in a row in one, you'd have no performance penalty - or you could have slightly different units in both paths and use the first path's units for all RSQ work.

I do not think that this is a good solution because there will to many communication between the units.

Uttar said:
Evantually, since it seems the the ADD and MUL units are not really united and got cache between them, you could even do 2 ADDs and 2 MULs in parallel.

No, this is not a cache. It is only a mux (crossbar) that select the inputs for the next stage from the output of the previous stage.

Uttar said:
Or you could do that in a completely different way, going away from the CineFX architecture, and that might infact be much more sensible, because this approach seems fairly icky.

Sure they can use an approach that go a kind of hyperthread way but a instruction sequenzer is a large unit.

My personal tip for NV40 is:

- Reduce the quadsize to 2 pixel (this is simple)
- 3-4 Pixelprocessor
 
mboeller said:
No;

fundamentally it's an 1 Pipeline-design, able to output 4 Pixel/cycle due to the SIMD-architecture.

I've not read the article yet, but what is this being based on?

I ask because I had a meeting with NVIDIA the other day and their new head of Dev Rel for Europe John Spitzer (he's not new, as he's been in the US for years, but now he's moving to Russia to head up dev rel for Europe there AFAIK). When asking John about the architecture I drew this:

Code:
------------ ------------
|          | |          |
|          |-|   Tex    |
|  FP32/   | |          |
|  Tex     | ------------
|  addr    | ------------
|          | |          |
|          |-|   Tex    |     \  / |
|          | |          |      \/  | |
------------ ------------      /\  |-|-
      |                       /  \   |
-------------------------
|                       |
|         FX12          |
|                       |
-------------------------
            |
-------------------------
|                       |
|         FX12          |
|                       |
-------------------------
            |                  |        |        |
---------------------------------------------------------
|                                                       |
|                Reg Combiner                           |
|                                                       |
---------------------------------------------------------

And he said that was correct for NV30, other than there was no register combiner - this, according to John, has been removed from the NV3x architecture.
 
DaveBaumann said:
mboeller said:
No;

fundamentally it's an 1 Pipeline-design, able to output 4 Pixel/cycle due to the SIMD-architecture.

I've not read the article yet, but what is this being based on?

On a nVidia patent describing a "programmable architecture", dated December 2002, where each unit works on 4 pixels at once.

Hmm, Dave, you mention the combiner going away - that patent, as said in the article, doesn't talk much about it, because it just says "whatever system for the combiner works" pretty much. So heck, the patent is vague enough for 2 FX12 units making as much sense.

I'd say taking all that with a bit of salt is preferable - a patent often can be intentionally vague so it covers more territory, and so on. So while the overall idea makes sense and is nearly certainly right, it's hard to make sure of the details since it seems obvious the drivers are not exposing everything...


Uttar

EDIT: Retrieved a few typos.
 
On a nVidia patent describing a "programmable architecture", dated December 2002, where each unit works on 4 pixels at once.

Is that the file date or issue date?

So while the overall idea makes sense and is nearly certainly right, it's hard to make sure of the details since it seems obvious the drivers are not exposing everything...

:?: Its not a question of the drivers not exposing everything for the shaders, is a question of having a compiler/optimiser tuned to the architecture, something which they haven't got right yet (and may not wholly before NV40, which will be different).

BTW (not sure how far I can go here, since some it might be NDA) when asking about some of the fundamental difference between NV30 and R300's architectures the description we came up with was "thin and deep" for NV30 and "shallow and wide" for R300, which is accurate enough - NV40 will be more along the lines of the "shallow and wide" approach.
 
DaveBaumann said:
On a nVidia patent describing a "programmable architecture", dated December 2002, where each unit works on 4 pixels at once.

Is that the file date or issue date?

Publication date: http://l2.espacenet.com/espacenet/viewer?PN=WO02103638&CY=gb&LG=en&DB=EPD

So while the overall idea makes sense and is nearly certainly right, it's hard to make sure of the details since it seems obvious the drivers are not exposing everything...

:?: Its not a question of the drivers not exposing everything for the shaders, is a question of having a compiler/optimiser tuned to the architecture, something which they haven't got right yet (and may not wholly before NV40, which will be different).

Well, that's very true, but what I said wasn't false either.
What I meant is that certain things are not exposed with the current driver. For example, the scalar system for multiplication ( see article ). It isn't put in the compiler, the compiler ignores that capability of the hardware. And obviously, a new compiler is required for that - and before we have that, it's hard to check some of the details like whether LIT can be done in parallel with MAD or not.

BTW (not sure how far I can go here, since some it might be NDA) when asking about some of the fundamental difference between NV30 and R300's architectures the description we came up with was "thin and deep" for NV30 and "shallow and wide" for R300, which is accurate enough - NV40 will be more along the lines of the "shallow and wide" approach.

Hmm... Dave's good ole hints ;)
So that'd mean the "more traditional than NV30" ( at least in appearance ) rumors are probably right.
Hmm, that could then be either 8 pixels, or 2x4 pixels. Or something different...
Hmm, something else that'd be plausible is that the pipelines would be significantly less complex than on the NV3x, with only general purpose units in them, and that special-purpose units are shared between VS and PS ( like the texture lookup units were supposed to be, but nVidia screwed up there obviously )

So ultimately... The NV30 would just be an experiment for the NV50, an experiment which turned bad? Or the influence on the NV40 might not be obvious. We'll see :)


Uttar
 
DaveBaumann said:
On a nVidia patent describing a "programmable architecture", dated December 2002, where each unit works on 4 pixels at once.

Is that the file date or issue date?

It is the Publication Date: 27 December 2002

It was file at 19 June 2002
 
Personally I'd expect patents that pertained directly to NV30 to have been filed in the 1999-2001 period, but I guess they could be late (why they would be I don't know).
 
DaveBaumann said:
BTW (not sure how far I can go here, since some it might be NDA) when asking about some of the fundamental difference between NV30 and R300's architectures the description we came up with was "thin and deep" for NV30 and "shallow and wide" for R300, which is accurate enough - NV40 will be more along the lines of the "shallow and wide" approach.

This gives a whole new perspective on the "penis size" wars between IHVs... :oops:
 
DaveBaumann said:
Personally I'd expect patents that pertained directly to NV30 to have been filed in the 1999-2001 period, but I guess they could be late (why they would be I don't know).

There is a referenz (Priority Data) to the US patent office with the date 19 June 2001.
 
Yes the NV30 Pipeline is very Deep

Code:
 |->-Shadercore (FP32)-<-|
 |    |   |   |------->--|
 |     TMU   Bypass
 |    |   |   |  
 |   Shader Back End <---< Registerfile 
 |       |                    | 
 |   Combiner (FX12)          | 
 |       |                    | 
 --<-Combiner (FX12) ---->---->

I believe NV40 will lock more like this:

Code:
 -->--Shadercore (2*FP32) <----> Registerfile
 |     |   |   
 |      TMU   
 |     |   |  
 --<--------
 
Demirug said:
I believe NV40 will lock more like this:

Code:
 -->--Shadercore (2*FP32) <----> Registerfile
 |     |   |   
 |      TMU   
 |     |   |  
 --<--------

BTW, you're making a mistake talking of TMUs here I think: The correct term should be a texture lookup unit.
And either they changed things last minute, or the texture lookup unit ain't in the pipeline - it's outside, so what you got are units calling the TMUs. The idea with that was to get texturing in the VS too.

So you'd have more like:
-->--Shadercore (2*FP32) <----> Registerfile
| | |
lookup <----->
| | |
--<--------

Also talking about the register file, it seems to me the NV3x got a performance penalty from the beggining - by that, I mean even if you got 4 registers in the first 100 instructions, then for the last 50 you use 16, you'll have the penalty of 16 registers for all the 150 instructions. That seems like a potential optimization to me in the NV4x, with the obvious idea of increasing the size of the register file ( I think it was doubled in the NV35, but since the number of FP32 units were doubled too, that didn't have much impact I guess )


Uttar
 
Uttar said:
BTW, you're making a mistake talking of TMUs here I think: The correct term should be a texture lookup unit.
And either they changed things last minute, or the texture lookup unit ain't in the pipeline - it's outside, so what you got are units calling the TMUs. The idea with that was to get texturing in the VS too.

So you'd have more like:
-->--Shadercore (2*FP32) <----> Registerfile
| | |
lookup <----->
| | |
--<--------

You are right. TMU was the wrong word.


Uttar said:
Also talking about the register file, it seems to me the NV3x got a performance penalty from the beggining - by that, I mean even if you got 4 registers in the first 100 instructions, then for the last 50 you use 16, you'll have the penalty of 16 registers for all the 150 instructions. That seems like a potential optimization to me in the NV4x, with the obvious idea of increasing the size of the register file ( I think it was doubled in the NV35, but since the number of FP32 units were doubled too, that didn't have much impact I guess )


Uttar

The additional FP32 unit should not make a difference because it only change the way to calculate data but have no impact on the storage of this data.

If nv make the pipeline shorter they need to store less pixel in the registerfile because it does take less clock counts for one pixel from the beggining to the end.
 
FYI, According to John Spitzer the FP units that replaced the FX12 units in NV35 are only capable of arihmetic ops such as MUL, ADD, SUB, DP3, DP4.
 
DaveBaumann said:
BTW (not sure how far I can go here, since some it might be NDA) when asking about some of the fundamental difference between NV30 and R300's architectures the description we came up with was "thin and deep" for NV30 and "shallow and wide" for R300, which is accurate enough - NV40 will be more along the lines of the "shallow and wide" approach.

Nice little bit of info there, Dave! Although we cannot extrapolate much from this piece, anything that will align the architectures more is good in my book - especially if NV40 will move in the direction of R3x0... ;)

DaveBaumann said:
FYI, According to John Spitzer the FP units that replaced the FX12 units in NV35 are only capable of arihmetic ops such as MUL, ADD, SUB, DP3, DP4.

So the rest is still handled by the 'combined' FP/FP texture unit we know from NV30? (Sorry to ask, I have been neglecting beyond3d this summer! :oops: )
 
DaveBaumann said:
Why? smack :!:

:D

Tries to duck the issue, but gets the smack right in, then bows down while mumbling some hopeless and lame apology about being on holiday or out enjoying the summertime with friends or attending other matters of importance in life - or at the very least playing 3D-games…
 
Back
Top