NVIDIA CineFX Architecture (NV30)

I completely agree with DemoCoder. When I heard of Cg for the first time, I thought "3D viewport accelleration." After a first glimpse at NV3x's architecture as presented in that (marketing) paper, my instant reaction was "off-line scanline render accellerator WEEEE!"

Yes, I think NV is diversifying to include new market shares. Or does anybody here seriously believe that these instruction numbers would be of any use for a gamers' VGA?

ta,
-Sascha.rb

P.S. thanks for your "moderation", JR.rb
 
DemoCoder said:
I think you guys are way off base when talking about NV30's "useless" extra features in the NV30. It is clear that NVidia did not intend these for games (DX9) but is positioning itself to take the RenderFarm/Workstation market by storm.

Evidence:

1) Constant talk of doing offline rendering in real time in marketing
2) CG Language
3) Aquisition of ExLuna (best RenderMan renderer, even better than PRMan in some regards)
4) Siggraph papers

... and let me introduce another piece of evidence before the Grand Jury:

5) Expert statement from John Carmack about "Realtime and offline rendering converging" on slashdot June 27.

Quote #1:

The current generation of cards do not have the necessary flexibility, but cards released before the end of the year will be able to do floating point calculations, which is the last gating factor. Peercy's (IMHO seminal) paper showed that given dependent texture reads and floating point pixels, you can implement renderman shaders on real time rendering hardware by decomposing it into lots of passes. It may take hundreds of rendering passes in some cases, meaning that it won't be real time, but it can be done, and will be vastly faster than doing it all in software.

Quote #2:

I had originally estimated that it would take a few years for the tools to mature to the point that they would actually be used in production work, but some companies have done some very smart things, and I expect that production frames will be rendered on PC graphics cards before the end of next year. It will be for TV first, but it will show up in film eventually.

I rest my case... ;)
 
I agree with you Democoder.

However, I doubt those features will make much of a hill of beans as selling points into the mainstream market.

However, if NVIDIA came up with a distributed client like SETI@Home or RC5 that allowed individuals to be a "part of" the next pixar movie, it might be a strange-ish selling point for some.
 
From the chip diagram of the R300 it looks like it's ALWAYS doing 128bit internal calculation. Looks like it's "free" on the R300, but incurs a penalty on the NV30. Also, they say 64k vertex shader instructions. They're probably including loops and such, in which case the R300 can do just as many.
 
Democoder,

Along the lines of what Russ was saying, when I was personally talking about the NV30's (apparent) increase in flexibility over R-300 being "useless", I metioned specificially that I was speaking in terms of value to the gamer, not the "3D productivity" professional.

I do certainly agree that both the NV-30 and R-300 architectures are astoundingly interesting from the Renderfarm / Workstation point of view...even more so than from the gaming aspects, IMO.
 
fresh said:
From the chip diagram of the R300 it looks like it's ALWAYS doing 128bit internal calculation. Looks like it's "free" on the R300, but incurs a penalty on the NV30. Also, they say 64k vertex shader instructions. They're probably including loops and such, in which case the R300 can do just as many.

1) What makes you think there's a "penalty"? If you get a speed up when going to non-128 bit calculations, is it a penalty to do 128 or not? What if, for example, NV30 is as fast as the R300 at 128 bit, but even faster at lesser precision? Of course, we have no idea about the relative performance between the R300 and NV30, so my statement is simply playing devil's advocate to your equally baseless assertion.

2) I think the extended instruction count discussion between the two chips is concerning pixel shaders, not vertex shaders.
 
Fresh - agreed. If you assume that NVIDIA are saying that one loop permits one extra instruction then with 1024 possible instructions and 64 maximum loops, you get 1024 x 64 = 65536.
 
Oh how I wish I could talk more about this. But how about this interesting quote from Carmack :

"My current work on Doom is designed around what was made possible on the original GeForce, and reaches an optimal implementation on the NV30. My next generation of work will be designed around what is made possible on the NV30."
 
I agree with you too Democoder,

But my criticism about 1000 passes is about real-time gaming for the mainstream gamer now or in the imediate 2 or 3 years. I believe everybody have a consensus about it.

Kudos to Nvidia for introducing an good graphics ISA for the professional market. Maybe this people will use some multichip products.

I am curious about this 16bit and 32bit FP that Dave talked. What will be the impact? :-?
 
ben6 said:
"My current work on Doom is designed around what was made possible on the original GeForce, and reaches an optimal implementation on the NV30. My next generation of work will be designed around what is made possible on the NV30."
Do you have a link?
 
pascal said:
I am curious about this 16bit and 32bit FP that Dave talked. What will be the impact? :-?

I suspect they can't have a floating point unit that is easily split into two smaller floating point units to allow twice as many half precision units, so it most likely will be a method for saving bandwidth.

If the developer knows that 16 bit FP is "enough", then they can use that, if they need 32, then they can use that.
 
Wow, heh, looks like I got in late on this subject :p

Anyway, after reviewing the paper, here are my comments:

1. The 65536 number is meaningless. It's 1024 static instructions with 64 possible loops. This appears to be identical to the R300 at this point in time. Since I don't yet know the full specs of either vertex shader, I am not aware of any differences here.

2. The 1024 number of instructions in the pixel shader is very exciting, and the possibility of "simulating" branching by executing all branches is excellent. This is, in fact, probably the only way it would work well in real-time hardware (as a side note, I believe I remember reading that Intel's Itanium hardware does this very thing...). This is far above and beyond what ATI offers in the R300, from what I can tell. Whether or not the added instructions will be used in games, however, is a different story. Hopefully we'll see some testimonials by game developers on this subject.

3. It is disappointing that there are still limitations in both pipelines. It seems to me that the programs should have been virtualized, with recommendations to game developers on the size of their programs, along with programming hints to keep cache hits high.
 
This is all great... But the reality is that by teh time a serious DX9 game ships both these cards will be to slow to deliver whts considered "leading edge performance"...

This Dx9 battle and who is better is best left for the next generation.....

On a side note, 1024 instructions is flat out overkill... Plin and simple. Besides well have to wait for the actual comparative benchmarks to see which architecture delivers the performance in the DX9 arena.
 
Would there be any money in making multi-chip boards for Nvidia?
For use in mini-renderfarms.

If so i think it would affect progress in better graphics in games.
I just love the thought of seeing Toy-story graphics in games and i think its very close actually.

Hopefully Cg and Rendermonkey will put out lots of Dx9 effects in games
 
. The 1024 number of instructions in the pixel shader is very exciting, and the possibility of "simulating" branching by executing all branches is excellent.

Is 1024 instructions enough to "simulate" branching in this manner (execute all branches and only use the results of the relevant ones?) Surely, even the 160 instructions of R-300 seem to be likely pushing what could be accomplished for real-time performance, so obviously 1024 is more than enough for that.

When real-time is not a requiremnet (for use in productivity / renderfarm type situations), being able to handle more instructions is of course more important...The assumption being it's still much faster on pixel shading hardware than pure software. But because this specific architecture must (apparently) execute all branches, how often would the 1024 limit be reached for non-real-time applications? How complex do these non real-time applications get with their renderman type shaders?
 
Hi Chalnoth,
Chalnoth said:
1. The 65536 number is meaningless. It's 1024 static instructions with 64 possible loops. This appears to be identical to the R300 at this point in time. Since I don't yet know the full specs of either vertex shader, I am not aware of any differences here.
IIRC, the R300 reaches its 1024 instructions with a shader depth of 256 operations / 1024 operations per pass. The NV30 appears to have a shader depth of 1024 operations / 65k per pass.

Correct me if I'm wrong. I didn't learn the whitepaper by heart. ;)

ta,
-Sascha.rb
 
pascal, it's a quote that I have a feeling you'll see in one of the Siggraph presentations very shortly. Nvidia has given the ok to use the quote.
 
Hellbinder[CE said:
]On a side nite, 1024 instructions is flat out overkill... Plin and simple. Besides well have to wait for the actual comparative benchmarks to see which architecture delivers the performance in the DX9 arena.

There are three main reasons that I can see why we need lots of instructions (1024 is certainly NOT overkill):

1. Backwards-compatibility. Developers shouldn't have to worry about hard limitations set by hardware.

2. High-end rendering. I guarantee you that 1024 instructions is not enough to do all shaders seen in movies and other high-end 3D rendering. These cards may be good enough to market in these situations, but their performance will be a fair bit below what we might see with fully-virtualized programs.

3. Freedom. What if you want to have one incredibly-long shader at some place in your game, but compensate by having few other large shaders? There are certainly reasons for incredibly-long shaders, and while it would certainly kill performance to have them all over the place, one or two that aren't used on much of the screen wouldn't be bad.
 
No point in doing multichip boards. Current software renderers are already designed to split up the scene and ship it to multiple computers for rendering. 10 cheap linux boxes with their 10 separate CPUs, agp buses, and NV30s will be alot cheaper, simpler to engineer/maintain, and better performing than a single linux box with 10 NV30s on a single board.

You want a separate CPU/AGP bus for each NV30 chip, otherwise, you are overloading the bandwidth. Moreover, no multichip board could have enough RAM to load the geometry and textures neccessary to render today's fullframe scenes for RenderMan. You have to split it up, even if you have 256mb of video RAM.

Then, let's not even talk about power density and heat dissipation.
 
nggalai said:
Hi Chalnoth,
Chalnoth said:
1. The 65536 number is meaningless. It's 1024 static instructions with 64 possible loops. This appears to be identical to the R300 at this point in time. Since I don't yet know the full specs of either vertex shader, I am not aware of any differences here.
IIRC, the R300 reaches its 1024 instructions with a shader depth of 256 operations / 1024 operations per pass. The NV30 appears to have a shader depth of 1024 operations / 65k per pass.

Correct me if I'm wrong. I didn't learn the whitepaper by heart. ;)

ta,
-Sascha.rb

Hrm, I'll have to look into that. If this is true, then the VS of the NV30 is tremendously more capable.

Btw, what is the point of having only four possible branches? If that's true about the R300's VS, then it is almost pointless to have flow control at all...
 
Back
Top