Tech-Report blasts GeForce FX

same goes for pixel shaders. those 1024 instructions might come handy when doing nice refraction effects and such. Its entirely possible to consistently use such long shaders umm.. say on 10% of the screen, and more reasonable-length shaders for rest.

you mean to keep an interactive (30fps or more) framerate, right?
 
The shader lengths that are possible depend on how fast the chip can execute shaders. I think the interesting distinction between the DX9 cards will not be what the maximum shader length is, but what is the longest average shader length that allows 30 or 60 FPS at 1024x768 with a reasonable level of overdraw.
 
Well you dont have to run a 1000 inst shader on you full-size buffer. You can do some cool effects (e.e. Glare) on much smaller, perhaps floating point, pbuffers.

My belief is that from the moment register combiners, vertex programs, pixel shaders etc etc where introduced we have a foot on a hardware ladder. The GF3, GF4, NV30 (and ATI equivalents) are, at the extreme end, really just platforms for demo effects - they keep folk like me and humus out of trouble ;) BUT todays 1000 instruction fragment shader demos are tomorrows in-game effects.
 
antlers4 said:
The shader lengths that are possible depend on how fast the chip can execute shaders. I think the interesting distinction between the DX9 cards will not be what the maximum shader length is, but what is the longest average shader length that allows 30 or 60 FPS at 1024x768 with a reasonable level of overdraw.

Well, you can calculate rough estimates for the pixel shaders fairly easily:
1024x768 = 786,432 pixels x 30 FPS = ~23.6 Mpixels/sec
Assume GFFX can do 2 64-bit FP pixel shader ops per clock per pipe. At 500 MHz, that's 8 billion ops/sec.
Further assume LMA III eliminates overdraw (probably only true if all rendering is front-to-back, but let's assume the best case).
8 billion ops / 23.6 million pixels = 339 instructions at 30 FPS, or 170 instructions at 60 FPS.

Calculations for a Radeon 9700 would be more difficult, since it can do anywhere from 1-3 pixel shader ops per clock cycle per pipe depending on how the shader is structured. But assuming it also averages 2 ops/clock/pipe, then at 325 MHz the limits work out to 220 instructions at 30 FPS and 110 instructions at 60 FPS.

Of course, the practical instruction limits would probably be much lower due to the pipelines not being 100% efficient, some instructions taking more than 1 clock to execute, overdraw, memory bandwidth bottlenecks, etc. Obviously, doing 1k instruction shaders isn't going to run at anything near real time on current hardware.

Vertex shaders instruction limits would be a bit harder to determine, since it the amount of data used per vertex could vary widely.
 
stevem said:
A1 (or A01) is the stepping of the current samples. It's a sensational accomplishment being the first re-spin & given various attendees' comments about stability. If the next spin is final, then a 125m transistor chip bedded down in 2 revisions is a great feat.

I'm not so convinced that this is as great an accomplishment as you make it out to be. AFAIK, many of the past few ATI chips have been finished within 1 or 2 revs (granted, not at .13 micron, but then I do believe that NV has had several failed, ie: DOA, NV30 tapeouts which they are not counting...)
 
tamattack said:
sc1 said:
NV30 unbalanced. NV30 unfulfilling.. Where's justification?

NV30 unbalanced: specs are so unbalanced when measured against the 'paper' currently available to you and me! :LOL:

Most would prefer to compare the final quality and performance of the product rather than judge by theoretical quantification.

A lot of people were quite excited when Parhelia was announced, however that excitement was dropped when the actual performance and price was determined.

I believe we'll see where the balanced product lies when we can finally make concrete comparisons. Comparing paper to a product when DX9 isn't available yet is rather silly. At least it's silly if you're trying to assert that anything is really provable at this point.
 
flf said:
Most would prefer to compare the final quality and performance of the product rather than judge by theoretical quantification.

A lot of people were quite excited when Parhelia was announced, however that excitement was dropped when the actual performance and price was determined.

I believe we'll see where the balanced product lies when we can finally make concrete comparisons. Comparing paper to a product when DX9 isn't available yet is rather silly. At least it's silly if you're trying to assert that anything is really provable at this point.

Uhhhmmmmmm... r u talking to me? I thought it was pretty clear that I was just F*CKIN' around...

As you say, nothing is proveable at this point.

But then, I've been drinking... and i might not make any sense at this point! :D
 
Basic said:
Pete:
Site was www.arstechnica.com
Under cover employe acting as amazed neutral persons was Rick Calle, Director of Marketing for ArtX. (In multiple versions.)
Counts as one of the big slimeballs of internet to me.

Btw
Some time ago when someone couted up 5 or 6 ATI employees at Beyond3D, DaveBaumann said something that strongly suggested that there actually are a lot more in under cover mode.

There a quite a few Ati employes from different divisions at Rage3d.com as well. At least ones that identify themselves. However, I do believe Ichneumon did a check once and there is quite a bit of overall traffic from ati.com so there a number of people from there that view the site without acually posting anything...

I will now go back to my regularly scheduled lurking :)
 
I think most of the ATI employees who browse Rage3D are the same ones who browse this site, such as myself.

There could be many lurkers who I am not aware of though ;)
 
andypski said:
I think most of the ATI employees who browse Rage3D are the same ones who browse this site, such as myself.

There could be many lurkers who I am not aware of though ;)

Ah but the number of people who do that would not make up the amount of traffic I think I remember Ich mentioning, it was quite substantial. Pehaps he will mention it again....
 
tamattack said:
stevem said:
A1 (or A01) is the stepping of the current samples. It's a sensational accomplishment being the first re-spin & given various attendees' comments about stability. If the next spin is final, then a 125m transistor chip bedded down in 2 revisions is a great feat.

I'm not so convinced that this is as great an accomplishment as you make it out to be. AFAIK, many of the past few ATI chips have been finished within 1 or 2 revs (granted, not at .13 micron, but then I do believe that NV has had several failed, ie: DOA, NV30 tapeouts which they are not counting...)

A01 is the first silicon back from the fab, not the first revision.
 
RussSchultz said:
A01 is the first silicon back from the fab, not the first revision.

Nevertheless, I am under the impression/belief that there have been several failed tapeouts which are not being taken into consideration.

Which, if true, would make this far from A01 silicon.
 
flf said:
tamattack said:
sc1 said:
NV30 unbalanced. NV30 unfulfilling.. Where's justification?

NV30 unbalanced: specs are so unbalanced when measured against the 'paper' currently available to you and me! :LOL:

Most would prefer to compare the final quality and performance of the product rather than judge by theoretical quantification.

A lot of people were quite excited when Parhelia was announced, however that excitement was dropped when the actual performance and price was determined.

I believe we'll see where the balanced product lies when we can finally make concrete comparisons. Comparing paper to a product when DX9 isn't available yet is rather silly. At least it's silly if you're trying to assert that anything is really provable at this point.

I agree with you, but there are a few things you can do with math. For example, NV30 has a measly 32-bits of memory access per pipe per clock. You can never get 100% efficiency, plus you need to write to and possibly read from a 32-bit colour buffer, read and write a 32-bit Z-buffer, and access textures. Now I know compression makes these smaller, but not by that much. Without AA you don't get much colour compression, so the colour write is already using all the bandwidth. With AA, compression only reduces the increase in bandwidth - it can't get any lower.

The Geforce4 has a very good memory interface, and you can tell by the minimal changes going to NV30 from what NVidia tell us (other than colour compression). It had around 70 bits of mem access per pipe per clock. Even so, the Geforce4 gets memory limited quite often, as witnessed both by overclocking experiments and when the GF4MX cranks out more than half the GF4Ti's framerate (the MX gets 128 bits of memory access per pipe per clock).

The eight pipes seems to be just a marketing feature, as they will never get that output on normal pixels. I think they should have done 4x2, with each pipe capable of twice the shading power, or 8x1 and skipped the leaf blower and lowered the clock. The only time 8x1 would be faster than 4x2 is when an odd number of cycles are needed (which is hardly significant)and in stencil/z-only operations like in Doom 3 where there is no colour/texture bandwidth needed. They already have multiple z-check units per pipe to speed this situation, though, so they would also be quite starved.

However, as you all have mentioned, theory is one thing and reality is another, so I could be wrong. The Parhelia is different though, as it has specs that should have made it a good performer. NV30 is the other way around, where it is theoretically unbalanced, and you can rarely outpace theory. I predict that 3DMark2001 single-texture fillrate will be less than 2000 Mtexels per second, a far cry from the claimed 4000 Mtexels per second, because that test uses alpha textures (>64 bits needed per pix).

I could go on, but I'm sure you're all bored of reading this now. It just seems to me that NVidia is going to their old self, like the Geforce2 which had astronomical specs but could barely outpace the Radeon, which had less than half of its pixel fillrate.[/b][/i]
 
Mintmaster said:
I agree with you, but there are a few things you can do with math. For example, NV30 has a measly 32-bits of memory access per pipe per clock.

#1 This is no different on the R300. The available bandwidth per clock is comparable.

#2 With the shift to programmable shaders, you will no longer be outputing 1 pixel/z value per clock. Those memory accesses will be amortized over several clock cycles. A shader that takes 32-100 clock cycles to execute will have more than enough bandwidth for the pixel/z writes. Texture fetches won't dominate either IMHO.
 
DemoCoder said:
#1 This is no different on the R300. The available bandwidth per clock is comparable.

#2 With the shift to programmable shaders, you will no longer be outputing 1 pixel/z value per clock. Those memory accesses will be amortized over several clock cycles. A shader that takes 32-100 clock cycles to execute will have more than enough bandwidth for the pixel/z writes. Texture fetches won't dominate either IMHO.

#1 What do you mean? R300 has more bandwidth and less clock frequency, so that means more bandwidth per clock.

#2 Sure, but then you will go at 30 FPS if you are lucky. 32-100 clock fragment shaders will be rare beast for some time yet. Change a bottleneck (bandwidth) for another (pixel shader op execution rate) isn't solving anything.
 
DemoCoder said:
Mintmaster said:
I agree with you, but there are a few things you can do with math. For example, NV30 has a measly 32-bits of memory access per pipe per clock.

#1 This is no different on the R300. The available bandwidth per clock is comparable.

#2 With the shift to programmable shaders, you will no longer be outputing 1 pixel/z value per clock. Those memory accesses will be amortized over several clock cycles. A shader that takes 32-100 clock cycles to execute will have more than enough bandwidth for the pixel/z writes. Texture fetches won't dominate either IMHO.

#1 No - from the accumulated information R300 has nearly twice the bandwidth available per-pipe-per-clock when compared to NV30 - both have 8 pipes, with core and memory clocks closely matched, but R300's memory bus is twice as wide. Not a difficult calculation ;)

#2 Depends on the situation, but generally speaking as shaders tend to get longer you are correct. Whether texture fetches dominate is largely down to the type of filtering applied, and the pattern of access. I agree that generally you would expect a 32-100 instruction shader to be largely calculation bound.
 
#1 This is no different on the R300. The available bandwidth per clock is comparable.

It was quite interesting - at our launch they actually admitted that it didn't have enough bandwidth to output all eight pixels.

However, something struck me the other day - NV30 is pretty much the same as Radeon 9500 PRO, just with nearly twice the clockspeed on both the core and memory. People should look for comparisons between the 9500 PRO and 9700 to see how a 128-bit bus will constrain an 8 pipe card.
 
andypski said:
#1 No - from the accumulated information R300 has nearly twice the bandwidth available per-pipe-per-clock when compared to NV30 - both have 8 pipes, with core and memory clocks closely matched, but R300's memory bus is twice as wide. Not a difficult calculation ;)


310Mhz memory on R300 PRO vs 500Mhz memory on NV30. 256-bit vs 128-bit. 19.8Gb/s vs 16Gb/s or only 23% more bandwidth. As I said, roughly comparable. These parts are only "unbalanced" if you consider the pathlogical single-texturing no-pixel-shader scenario, which isn't very interesting. There is no such thing as a truly balanced card. You are either fillrate limited or bandwidth limited or T&L limited or CPU limited. No one has shipped a system where all the limits line up and happen at the same time.

And please, let's not talk about a hypothetical ATI card using 1Ghz memory yet. Let's compare currently announced products.


#2 Depends on the situation, but generally speaking as shaders tend to get longer you are correct. Whether texture fetches dominate is largely down to the type of filtering applied, and the pattern of access. I agree that generally you would expect a 32-100 instruction shader to be largely calculation bound.

Point being, the "32-bits per pipe" is not the limiting factor. Yes, to write out the final fragment, you need 2 or more clocks potentially, and you also need some bandwidth up front for rejection, assuming no cache hits.

The shader itself, unless you are talking single textured pixels and no per pixel lighting of any sort, will execute in way more than 2 clocks. Even simple diffuse/specular shaders are going to eat 4-8 clocks, and so those memory accesses are going to be hidden by the shader's execution.

On old legacy games, in single/dual textured scenarios with no per-pixel anything, these cards already have such ridiculously high fillrates that even when they hit their bandwidth limits, they are well above 100fps at high resolutions and high AA, so it's a moot point.

So yes, Counter-Strike won't hit 4gigapixels of fillrate on the NV30, but when the NV30 does hit its bandwidth wall, the game is already running at ridiculous rates. The pixel fillrate isn't the important thing anymore, it's the shader fillrate.

The NV30 may hit 4+ giga-shader-ops/s, which is the more important figure for DX9 cards. Unless you want to resort to the old redherring of "no DX9 games, so who cares about shader performance. ", but then there's not much point talking about programmability at all. That will be the resort of some: "well, I only care about old single textured and dual textured game performance!"

If you think programmability is the future, then obviously, the speed at which you can execute programs is now the important figure. The Pentium4 and Athlon do not have enough bandwidth to write out one register to memory per cycle. Yes, we do not speak about CPU bandwidth limits. As things become calculation bound, the external bandwidth will be less of the determining factor in overall performance, and memory latency or bandwidth problems can be handed by pipelining and prefetching.

I mean, utilizing the "32-bit per pipe" argument, these cards are both ridiculously bandwidth limited in the 128-bit FP texture/framebuffer scenario. But the fact is, if you are using 128-bit FP buffers, you are most likely running significant shaders.
 
RoOoBo said:
#2 Sure, but then you will go at 30 FPS if you are lucky. 32-100 clock fragment shaders will be rare beast for some time yet. Change a bottleneck (bandwidth) for another (pixel shader op execution rate) isn't solving anything.

You will ALWAYS BE BOTTLENECKED.

However, If I am not using pixel shaders, then 1600x1200x32 4XFSAA 16xANISO @ 60fps is currently possible even with large depth complexities, so why should I want more raw pixel fillrate? Unless I am doing multipass, I don't want it.

What I want to do is more calculations per pixel. I am not doing this to "hide" bandwidth bottlenecks, I am doing it because I have simply reached the limit of what I can do quality wise with bog-standard multitexture. The fact is, all high quality CGI today uses programmable shading. To break the quality barrier, we need to switch to programmable shading, so more DirectX6 level fillrate isn't needed as much. What you can achieve with that extra fillrate takes alot more work, is harder to develop for, and yields diminishing returns.


It's simply a trend that as developers strive for higher image quality, their new bottleneck is pixel shader op execution rate. Go talk to Pixar, PDI, or Weta and see if their problem is system bandwidth or floating point performance.
 
Back
Top