nV40 info w/ benchmarks

According to Warp2Search

http://www.warp2search.net/modules.php?name=News&file=article&sid=16154

Just got word that the guy's at 3DCenter have snagged some Info on Nvidia's upcoming range of grahics cards NV40, NV41, NV45! as the the sites in German, I've added a Babel fish translation.

nVidia NV40

175 million transistors, in 130nm manufactured
8x2 architecture, however 16 Z/Stencil tests per clock
DirectX 9,0 architecture, supports Shader 3.0
opposite NV38 and Shader doubled more efficient pixels
supports DDR1, GDDR2 and GDDR3
internal AGPx8 interface
exact clock rates: unknown; there is estimated 500 to 600 MHz chip clock and 600 to 800 MHz storing act
Improvements with anti- Aliasing: (at least) a new mode; its SAM polarizing sample is however unknown
Improvements with the anisotropic filter: unknown
Presentation: GDC or CeBIT at the end of March 2004
Market entrance: At the end of April or May
2004 Sales name: GeForceFX 6XXX

http://babelfish.altavista.com/babe...tp://www.3dcenter.de/artikel/2004/01-27_a.php
 
Unknown Soldier said:
Just got word that the guy's at 3DCenter have snagged some Info on Nvidia's upcoming range of grahics cards NV40, NV41, NV45! as the the sites in German, I've added a Babel fish translation.
Looks like more speculation than anything to me. I doubt they have any special information.
 
Unknown Soldier said:
Maybe some of our German friends would like to translate it properely?

US
A translation is in the works; should be done in a couple of days. But, really, it's just one of our usual collected speculation articles (we've been doing such pieces since the year 2000). It's not as if we strapped down Vivoli and tortured the final specs out of his prone body or something. ;)

93,
-Sascha.rb
 
Why is that even though the title includes the NV41 and NV45, only NV40 information is given, and that it's all recycled info? :p ;)

BTW, it's not 600-800Mhz for the GDDR. It has been confirmed at 750-800Mhz (1500-1600Mhz effective) a long time ago; the only question left is whether NVIDIA will underclock it to 750Mhz or not, really.

Regarding resgister usage, I'm seriously thinking the trick, this time around, is a significantly better usage of the available registers, not only a higher number of them. I'm thinking of two things here:

1) Much shorter bypass path in case there's no texturing going on. I would tend to believe (much) smarter logic is required here no matter what, because: a) With the VS (ab)using the PS' texture lookup units, they might not always be available. And b) With dynamic branching, this type of stuff would allow more efficient usage of the lookup units (although I'm extremely doubtful of the NV4x doing that sort of stuff).

2) Ability to increase and decrease the number of registers dedicated to a quad "on the fly" (although once again, I'm doubtful of that). Basically, at the beggining of the program, all quads begin with their maximum register count; but when registers are going to be idle for sure during all the following passes, they are liberated. Not starting with the maximum number of registers would be better, but I don't think it's cost-effective, really.

Hmm, that gives me an idea on how to check on that speculation. I'll see if I can get any confirmations, although I doubt I will.


Uttar
 
I've heard the late April/May launch timeframe mentioned as well. By waiting they can..

-ramp up production with the A2 stepping (which should be capable of 500MHz+).
-accomodate an NV40 + D3 bundle; otherwise impractical.

MuFu.
 
Uttar said:
With the VS (ab)using the PS' texture lookup units, they might not always be available.

Hmmm, surely VS must have its own texture lookup functionality, since after all, most pixel rendering does involve texturing. I would think it would be rather ungood to have VS and PS both bottlenecking each other.

Surely this lookup hardware can't be prohibitively expensive in terms of transistors?
 
Guden Oden said:
Uttar said:
With the VS (ab)using the PS' texture lookup units, they might not always be available.

Hmmm, surely VS must have its own texture lookup functionality, since after all, most pixel rendering does involve texturing. I would think it would be rather ungood to have VS and PS both bottlenecking each other.

Surely this lookup hardware can't be prohibitively expensive in terms of transistors?
The main problem is that a texture lookup usually has a very large latency (easily 50-100 clocks if you get a texture cache miss, which is frequent enough that you have to assume that it happens as the default case). In a regular pixel shader, you mask this latency by interleaving execution between 100+ pixel groups. This interleaving requires large on-chip buffers, which costs large numbers of transistors. If you don't share the texture mappers between the vertex and pixel shaders under VS/PS3.0, you end up either paying this cost twice or get a vertex shader that is very, very slow at texturing.
 
Arjan,

So why not have a separate lookup thingy for VS, but use the same on-chip caches/buffers as the PS lookup thingies?

Edit: Oh, and by the way, thanks for the great explanation, man! It's appreciated
 
MuFu said:
Remember that it has 175 million transistors and nV have seen fit to give it 50GB/sec+ memory bandwidth. Current rumours are that it can work on 4 quads in certain situations. That would seem to suggest that it's an 8x2/16x0 design (in the same way that NV35 can be thought of as 4x2/8x0) and may have approximately twice the pixel throughput of NV35, per clock. I've heard that PS performance is already well above current parts, even on the A0 samples. VS shows less of an improvement; maybe they have just incorporated a single, extra VS unit.

They might ramp with A1; it's apparently not as clock-limited as they thought it might have been.

MuFu.

I don't understand how you can think of nv30/5/8 as anything but 4x2. The information nVidia has stated plainly is that only 4 color pixels per clock can be rendered to screen--under no circumstances may 8 color pixels per clock be rendered to screen, regardless of whether or not a texel is attached to a pixel. "8x0" means to me "8 color pixels rendered to screen per clock without texels attached."

"8 black & white z-pixels per clock" rendered internally in nV3x, which is actually what nVidia claims, does not equal "8 color pixels per clock rendered to screen without texels," it seems to me. This information, coming directly from nVidia, indicates nV3x has a maximum of 4 pixel pipelines, and may not render more than 4 pixels per clock to the screen, regardless of whether there are 0,1, or 2 texels attached to those pixels. So I would label the pipeline organization of nV30/5/8 as "4x0 or 1 or 2," depending on the software demands.

R3x0, likewise is "8x0 or 1," depending on software, and in the case of multitexturing software is able to use 4 of its pixel pipes for texel generation sans pixels, and becomes 4 (pixels)x2 (texels attached to each pixel) per clock rendered to screen. I don't see how this forumula applies to nV3x, because nV3x has a ceiling of 4 pixel pieplines, and R3x0's is 8.

It seems to me that if nV4x is capable of 16x0 per clock, it must have 16 pixel pipelines. I consider 8 (pixel pipes) x2 (texel units per pipe) per clock much more likely than 16 pixel pipes. While I can't see how an unused texel unit attached to a pixel pipeline can be used for per clock, render-to-screen pixel generation, it's easy to see how a full pixel pipe may be used exclusively for texel per clock creation (since a texel is a sub unit of a final pixel, and texels are never rendered to screen independently of pixels.) IE, there's a big difference between texel units and pixel pipes, IMO.

Basically, in R3x0, multitexturing uses 4 of its total of 8 pixel pipes for the creation of 4 pixels per clock rendered to screen, but it uses all 8 of its texel units per clock, each of which is attached to a pixel pipeline. nV30/5/8 cannot do that, because they have only 4 pixel pipelines, and so can only render 4 pixels per clock to screen, whether 0, or 1, or 2 texels are attached per clock per pixel. The difference is that in single texturing, R3x0 can do 8 (pixels per clock) x1 (texel per clock per pixel), but nV30/5/8 can do only 4 (pixels per clock) x1 (texel per clock per pixel.) So I just can't see how nV30/5/8 might be accused of the "8x0" organization you mention, since that would mean they would have to be able to generate 8 pixels per clock to screen, but nV30/5/8 have only 4 pixel pipelines, so that won't work.

As to "175 million" transistors having any sort of performance bearing, I can't see raw numbers as being relevant (even assuming the current rumor is correct), except peripherally to yields/heat/power/clocking considerations. Otherwise, simply reciting the raw bulk transistor count is about as meaningful, or as accurate, as declaring that because a GM engine has "more parts" than a Ford engine, it will be the faster engine. "It's not the size of the boat, but it's the motion of the ocean that counts," as the saying goes...:) Likewise, it's not the number of transistors in a chip that counts for performance--rather, it's what the transistors do, and how efficiently they do it, that makes the performance difference. IIRC, nV3x has more transistors than R3x0, and is a lot slower at many things.
 
Well I said "...in the same way that NV35 can be thought of as 4x2/8x0", i.e. if you like to think of the CineFX architecture in that way (as a lot of people do) then NV40 is 8x2/16x0 - like NV35 pixel processing double-stacked, although it's obviously not as straightforward as that. C'mon Walt, you know what I mean. :LOL: I don't think "8x0" is particularly satisfactory either - it's just a supplementary reference that's appended to describe the "z-only" mode.

WaltC said:
As to "175 million" transistors having any sort of performance bearing...

It doesn't and I agree with what you've said WRT that completely. It wasn't why I brought it up.

MuFu.
 
I see exactly what MuFu is saying, and I would have to agree with his way of looking at NV35 and what will likely be NV40's configuration.
 
Whoa, 3 edits. :?

I see "x0" as being predicative of TMU redundancy in the "zixel" mode and therefore a required inclusion. I know it's nice to think of everything in terms of "actual" pixel pipelines, but that's going to become almost impossible with DX10-level architectures - especially if you want to sum up every "modus operandi" concisely.

MuFu.

P.S. nV expect NV40 to be about 20% faster than R420 across the board in D3. Ooooooh!!!
 
Almost certainly the NV30 path - a mixture of FP16 and FX16(pad) if Uttar's info is correct. Hasn't JC already mentioned how he's using FP and FX in NV30?

MuFu.
 
MuFu said:
Almost certainly the NV30 path - a mixture of FP16 and FX16(pad) if Uttar's info is correct. Hasn't JC already mentioned how he's using FP and FX in NV30?

MuFu.

I seem to recall he has but I'm old and can't remember.
 
Doomtrooper said:
MuFu said:
P.S. nV expect NV40 to be about 20% faster than R420 across the board in D3. Ooooooh!!!

JC will ensure it is ;) , which is moot anyways with a 60 fps frame cap.
Didn't Johnny-boy say that nVidia was going to have an 80 fps frame cap and all other cards would be a 60 fps frame cap? :|





















;) j/k! (Hey, it'd be the edge nVidia needs....)
 
Heh, I forgot about the FPS cap. Well I'm sure the "The Way It's Meant To Be Benchmarked" team will sort something out.

cho, that isn't right? I haven't seen it suggested anywhere other than on forums, so you could be correct. Didn't you also say it would support FP24 though? :?

MuFu.
 
Back
Top