NV31 closer to NV35 than NV30? - More pipeline mysteries.

boobs

Newcomer
From the Inquirer
NV31 (GeForce 5600)

In blend stages mode (without any pixels shaders) this chip is really capable of performing as 4х1.

In all shader modes, including 1.1/1.4/2.0 shaders we have something similar to the 2х2 picture - textures fetched only by pairs. Interesting that now 1.1 and 2.0 shaders are executed in same manner, probably on one universal ALU, not in different way as for NV30. Seems that this is new, NV35 like pixel pipeline technology -- probably the NV35 also will be reconfigurable in such a way from 8х1 to 4х2, depending on the task.

http://www.theinquirer.net/?article=8196
 
There was an interview a little while back where an nVidia spokesperson stated that the NV30 was very flexible, and was selected to run in 4x2 format for "performance reasons."

Apparently, when the NV30 runs in "8-pixel mode" for z/stencil only rendering, it renders with two 2x2 pixel blocks.

It seems to make sense that working with two different 2x2 pixel blocks would result in lower cache coherency, meaning that while running in "8x1 mode" would allow for better usage of the available computing resources on the chip, it wouldn't allow for optimum memory bandwidth usage.

And one other thing. By running in 4x2 mode, the FX may actually be able to capitalize on the optimizations made for nVidia's previous 4x2 architecture chips.

Finally, from the cache coherency standpoint above, there would be little reason to run the NV31 in 2-pixel mode. After all, that would also remove the possibility of using the ddx/ddy operations. Assuming that the NV3x architecture is actually as flexible as nVidia says, then the only reason that the NV31 would benefit from running in 2-pixel mode would be if it didn't have the on-chip cache to handle 4-pixel operation.

And if the NV31 does indeed prove to run in 4-pixel mode (which seems highly likely when you consider how the ddx/ddy fragment operations work), then sometime in the future, nVidia may release optimized drivers for the NV30 that get it to run in 8x1 mode "full time," but I wouldn't hold my breath.
 
And one other thing. By running in 4x2 mode, the FX may actually be able to capitalize on the optimizations made for nVidia's previous 4x2 architecture chips
Im speechless... :rolleyes:

So much for moving the industry forward with actual "technology" and not Fancy "Useless" PR Spin eh???
And if the NV31 does indeed prove to run in 4-pixel mode (which seems highly likely when you consider how the ddx/ddy fragment operations work), then sometime in the future, nVidia may release optimized drivers for the NV30 that get it to run in 8x1 mode "full time," but I wouldn't hold my breath.
Complete Techno Gobbeldygook... The Nv30 is a 4x2 with some processing taking place at 8 pixels. Thats it. The entire Rest of this Flexable Techno Jargon is complete PR BS Spin. That holds no water in reality whatsoever.
 
Dave H said:
Sigh. The Inquirer filches yet another "news" story from a B3D forum comment without attribution. (I'm assuming; right now their site appears to be down so I can't check.)

so lets go flame the editor :LOL:

no seriously , is there nothing that can be done against such thiefs? o_O
 
Hellbinder[CE said:
]
Complete Techno Gobbeldygook... The Nv30 is a 4x2 with some processing taking place at 8 pixels. Thats it. The entire Rest of this Flexable Techno Jargon is complete PR BS Spin. That holds no water in reality whatsoever.

How do you know for sure? Unless I'm mistaken, you're working from the same info we are and that means the same test results. And all they tell us is that the NV3x appears to operate as a legacy 4x2 setup as currently exposed by the drivers.

Unless of course you have direct access to the design team and they've told you different...
 
The plagiarism is so blatant they just lifted whole phrases, not just numbers. Mind-boggling. OTOH, they later linked to DL, which is where we got our info (albeit on our forum), so there was semi-relevant credit given. "We all know" about NV30's architecture mainly thanks to our huge thread, though.
 
The pipeline configuration of a card is an implementation detail. Over the years, we have gone back and forth from 1-3 texture units per pipeline.

(Voodoo2, 2 units per pipe, TNT 1 per pipe, Voodoo5, 1 per pipe, GF2 2 per pipe, R100 3 per pipe, R300 1 per pipe, GF FX (variable pipes)...)

The optimal pipeline configuration depends on the rendering task at hand. If most of your pixels are single textured, then a x1 config is better, if you are doing Quake style multitexture or an even number of texture stages, a x2 config is better.

For games that switch back and forth between single textured passes and multitextured shader passes, it depends on which pass requires the highest fillrate. Obviously, it would be nice if the configuration of the pipelines themselves was programmable.

It is not true that 8 pipelines are always better than 4, since it depends on what you are rendering. Remember, both the Voodoo5 and TNT had "variable pipes". In single texturing, you had twice the number of pipes available for writing pixels than in multitexturing mode. Also, depending on whether trilinear was enabled on not, some cards would "lose" a texture unit.

The GFFX's architecture is probably just another variation of the type of limitations the V5, TNT, etc had where you lost resources depending on rendering state.

It would be nice if the NV35 could do something like: write 16 Z+Stencil values OR write 8 dual textured pixels per cycle. That would take care of the Doom3 case quite nicely.
 
DemoCoder said:
The pipeline configuration of a card is an implementation detail. Over the years, we have gone back and forth from 1-3 texture units per pipeline.

(Voodoo2, 2 units per pipe, TNT 1 per pipe, Voodoo5, 1 per pipe, GF2 2 per pipe, R100 3 per pipe, R300 1 per pipe, GF FX (variable pipes)...)

Just out of curiosity, did any games ever end up benefitting from the R100's 3 TMUs per pipe?
 
Nagorak said:
DemoCoder said:
The pipeline configuration of a card is an implementation detail. Over the years, we have gone back and forth from 1-3 texture units per pipeline.

(Voodoo2, 2 units per pipe, TNT 1 per pipe, Voodoo5, 1 per pipe, GF2 2 per pipe, R100 3 per pipe, R300 1 per pipe, GF FX (variable pipes)...)

Just out of curiosity, did any games ever end up benefitting from the R100's 3 TMUs per pipe?

Dunno, Serious Sam maybe?

Wasn't there an S3 card (Savage?) that had 4 TMUs per pipe?
 
DemoCoder said:
The pipeline configuration of a card is an implementation detail.

When all that goes into a pixel is textures, you are right, there is not much difference between a 4x2 and an 8x1. But as DemoCoder has pointed out in these forums many times, what will be important for future applications is shader execution speed (rather than bandwidth or texture fetch bottlenecks). And having 4 shader execution units rather than 8 does make a big difference in shader execution speed.
 
But we currently don't know the exact processing power of the NV30 when it comes to shaders. The drivers suck too much to expose however much power it has (either that, or the hardware is seriously borked...which would be a first from nVidia...).
 
DemoCoder said:
Nagorak said:
DemoCoder said:
The pipeline configuration of a card is an implementation detail. Over the years, we have gone back and forth from 1-3 texture units per pipeline.

(Voodoo2, 2 units per pipe, TNT 1 per pipe, Voodoo5, 1 per pipe, GF2 2 per pipe, R100 3 per pipe, R300 1 per pipe, GF FX (variable pipes)...)

Just out of curiosity, did any games ever end up benefitting from the R100's 3 TMUs per pipe?

Dunno, Serious Sam maybe?

Wasn't there an S3 card (Savage?) that had 4 TMUs per pipe?

Well Savage 2000 was able to output one quad textured pixel per cycle or two dual textured pixels per cycle. In fact its pipeline configuration was flexible if I remember correctly. While we're talking about flexible pipeline configurations I've heard that Flipper was pretty good in this area, don't know for sure how it works because Nintendo's docs aren't publicly released but TEV is configurable in a lot of ways. NV10 was alse able to adapt its configuration : 4 single textured pixels or 2 dual textured pixels but NV15 lost this ability by splitting each TMU Nvidia probably thought that dual textured performance was predominant at that time.

Oh and Matrox Parhelia has 4 TMU per pipe also but it don't seem that useful today, I was wondering why nobody was applying the same technique used by the Flipper (especially that ex ArtX ingeniors are now part of ATI) because I see these "fixed" pipeline like a waste of ressources
 
Actually, there is one other possibility.

According to some rumors ( although those aren't the most reliable ) , in each of NV30's pipe, a FP ALU operation couldn't be done at the same time as a TMU operation.

So, the idea would be that the NV30 is really 4x2 by driver limitations.
But that it couldn't be 8x1.
Only 4x2, 8x0 or 8x0.5

What is 4x2? Well, 4x2 is 4 pipelines + 8 TMUs, for a total of 12 operation units.
8x1 is 8 pipelines + 8 TMUs, for a total of 16 operation units.

8x0 is 8 pipelienes + 0 TMUs, for a total of 8 operation units
8x0.5 is 8 pipelines + 4TMUs, for a total of 12 operation units.

See where I'm going to?
Saying nVidia is forcing the drivers for it to be a 4x2 might be accurate. But at the same time, it might be incorrect to say nVidia could make it a 8x1

The question is if the NV3x architecture is sufficently flexible to have 1 TMU for 2 pipes. Maybe that's broken, too. Maybe it'll be made possible in future drivers. Or maybe in the NV35 ( that would make a lot of sense, IMO - it seems like an easy optimization )

Another question is whether switching between 4x2, 8x0 and 8x0.5 would cause an important stall.
It would seem likely that rearranging everything like that causes a fairly important stall. But how big?
Obviously, you couldn't switch it for every polygon to have the most optimal arrangement each time. That would KILL performance.
It's more on a per-pass basis. Evantually, with hardcoded settings, per-game.


As always, just speculation.


Uttar
 
DaveBaumann said:
It can never write more than 4 pixels per clock - NVIDIA have stated this.

Yeah, I know. I was mostly talking about pipeline arrangement, though.
I doubt a 8x0.5 ( which could be 2 times faster in PS FP Arithmetic-limited programs ) would write more than 4 pixels per clock on average, anyway.
They could have optimized the caches for 4 pixels per clock, and made it impossible to do 8 pixels per clock due to the whole memory controller.

Sure, it's unlikely. But it isn't impossible.


Uttar
 
To get back on the original topic...it's interesting to note that NV31 and NV34 are indeed plain old 4x1 architectures. No secret 2x2, no "Hyper-Zixel".

The first point is not too surprising, considering the 128-bit DDR memory bus. The second does seem slightly surprising to me, since DDR-z should be a valuable optimization for many future engines and considering we know it is done using the extra z-units already in place for MSAA, would not appear to carry a large hardware cost (although there will be some hardware cost).
 
DemoCoder said:
Nagorak said:
DemoCoder said:
The pipeline configuration of a card is an implementation detail. Over the years, we have gone back and forth from 1-3 texture units per pipeline.

(Voodoo2, 2 units per pipe, TNT 1 per pipe, Voodoo5, 1 per pipe, GF2 2 per pipe, R100 3 per pipe, R300 1 per pipe, GF FX (variable pipes)...)

Just out of curiosity, did any games ever end up benefitting from the R100's 3 TMUs per pipe?

Dunno, Serious Sam maybe?

Wasn't there an S3 card (Savage?) that had 4 TMUs per pipe?

i don't know much about the savage but many games use more than 2 texture passes. granted they don't tend to use them much but all sorts of special effects are done with extra texture passes.
 
Back
Top