David Kirk of NVIDIA talks about Unified-Shader (Goto's article @ PC Watch)

Voltron said:
This is ridiculous. You are saying Kirk is wrong because the extra transistors in the 6600 went to bandwidth saving techniques? If that were the case, then why does it outperform the nv35. A simple look at benchmarks shows that the 6600 has massively improved shaders. Of course it architecturally different. Thats my point. The NV30 architecture sucked, but 128-bit is more than sufficient when the shader performance is there.
Look at the B3D review for the 7600GT. (I know I've said this before in different threads, but too many people underestimate the value of bandwidth.)

The 7600GT has 2.6x the MADD issue rate of the 6800GS. It has a 31% higher clock, which equates to 31% higher speed in everything else due to pipeline configuration similarities between the two. It has all the improvements of the G7x architecture. Yet it performs only 15-20% faster on average.

My guess is the 7600GT would be ~25% faster than it currently is if it was equipped with a 256-bit bus. Now consider the top end chips, which have over twice the horsepower of the 7600GT. I would hardly call a 128-bit bus sufficient.

Shader performance isn't everything.
 
pjbliverpool said:
Agreed, but that can apply to both Xenos and G71. What I want to know is why are people saying Xenos is so much better at vertex texturing, im assuming there is some technical explanation that im missing and that its not just an assumption.
G71 has 4 "vertex texture units", IIRC. They only support FP32 1- and 4-channel textures with point sampling, and texturing latency has to be hidden by independent ALU instructions, at least in part.

Xenos has 16 texture units with filtering and 16 "point sampling units", they support other, possibly more bandwidth-efficient texture formats as well, and Xenos' thread scheduling does a good job at hiding texturing latency.

But if it does have dedicated units, where are they? In the unified ALU's? A seperate array like the texture samplers and if so, how many? I remember reading that the scheduler could be programmed for dynamic branching, is this it?
It's one "flow control unit" per 16-ALU-block. There's nothing special about dedicated units for branching. You can't easily reuse that functionality for something else, so there just have to be dedicated units.
(Well, technically, in a primitive CPU architecture where the instruction pointer is just another register, the ALU can do all the branching, but that doesn't work on GPUs)
 
Gateway2 said:
I've wondered this, even if Nvidia has Xenos in the labs, could they run any software tests on it? Because X360 hasn't been cracked, and doesn't that mean only certified software can run on it?

I mean I guess they could X-Ray it, but they cant run hypothetical Nvidia-3DMark 06 for consoles on it, I dont believe.

maybe ATI lent them a dev kit.
 
Gateway2 said:
I've wondered this, even if Nvidia has Xenos in the labs, could they run any software tests on it? Because X360 hasn't been cracked, and doesn't that mean only certified software can run on it?
Is there anything stopping them from getting a development kit? Even if it's not officially NVidia themselves, I'm sure there's a roundabout way could do it through a sub-contractor.
 
Mintmaster said:
Look at the B3D review for the 7600GT. (I know I've said this before in different threads, but too many people underestimate the value of bandwidth.)

The 7600GT has 2.6x the MADD issue rate of the 6800GS. It has a 31% higher clock, which equates to 31% higher speed in everything else due to pipeline configuration similarities between the two. It has all the improvements of the G7x architecture. Yet it performs only 15-20% faster on average.

My guess is the 7600GT would be ~25% faster than it currently is if it was equipped with a 256-bit bus. Now consider the top end chips, which have over twice the horsepower of the 7600GT. I would hardly call a 128-bit bus sufficient.

Shader performance isn't everything.

Cant you just increase the ratio of math on RSX code though, until BandWidth doesn't matter?

Also I've seen new benches where 7600GT outruns 6800GT by 50%, on Firing Sqaud. Despite less bandwidth. That is where the new Madds come in. FS was quite impressed and noted it.
 
What’s really surprising though is the GeForce 6800 GT’s showing in our performance testing today. At one point we actually loaded up Quake 4 and ran some benchmarks to make sure our card was running correctly. NVIDIA’s GeForce 7600 GT runs circles around the GeForce 6800 GT, and in some cases the 7600 GS is able to give the 6800 GT a run for its money! We didn’t expect the 7600 GT to come close to even matching the GeForce 6800 GT in performance, but in our tests today it not only equaled the 6800 GT in performance, it outperformed it by a double-digit margin. When the GeForce 7 first launched nearly a year ago, NVIDIA was quick to boast about GeForce 7800 GTX’s enhanced pipeline efficiency, claiming that the 7800 GTX was 50% more efficient on a clock-for-clock basis than GeForce 6. Based on our results with Oblivion today, that figure certainly sounds believable.

There
 
pjbliverpool said:
Thanks for that, the comments about the supported texture types and comparitive lack of filtering make sense. But im still wondering where the fast vs slow thing comes from. Was it a dev comment, a benchmark, or something techncial?

In a unified design texturing is texturing, either it's fast everywhere, or it's slow everywhere. Since it's a pretty common pixel shader op it's very likely to be fast everywhere.

On G7x a vertex texture read is at least 20 cycles and I've measured as high as 200. read very slow.

It should be noted that there are negative side effects to a unified design, vertex shaders have been recently more MIMD like, but in a unified design, if your SIMD your SIMD in both vertex and pixel shaders, this can impact the efficiency of certain operations that are more common in vertex shaders.

Xenos is pretty close in implementation to using pixel shaders to do vertex work, and what that implies, for the most part that's pretty efficient.

I'll be interested in seeing R600 benchmarks vs G80 or whatever.
 
Gateway2, I saw that too. Mostly it's around 35% between the 7600GT and 6800GT with HDR, so per pipe per clock there's only a couple of percent difference. Furthermore, just because the 7600GT is 35% faster than a 6800GT doesn't mean it wouldn't run faster without more bandwidth. In the high-end firingsquad review, the 7900GT outperforms the 6800GT by 100-150%, but the pipeline configuration and clock would suggest 92%.

EDIT: Why did FS say "We didn’t expect the 7600 GT to come close to even matching the GeForce 6800 GT in performance."? In every other game the 7600GT is faster than the 6800GT/GS.

Also, you're not going to make a game faster by adding math code. It's a matter of whether you can make it better looking. I've seen shaders from FEAR and FarCry posted by B3D members that were horribly inefficient. They did unnecessary work, and this probably explain why FEAR didn't look as good as it should have considering how much it taxes the hardware. Some things are best done with lots of pixels from simple shaders.

Finally, note that David Kirk made that statement a long time ago. Pixel shaders were not nearly at the complexity where only the shader architecture mattered. Bandwidth was very important for fillrate in these games. 'Overkill' is a strong word for what was and remains to be a very tangible advantage.
 
Last edited by a moderator:
ERP said:
On G7x a vertex texture read is at least 20 cycles and I've measured as high as 200. read very slow.
Nice, so you've actually measured it? I'm curious as to what the transform rate is for a simple displacement map. A single channel FP32 texture, with shader essentially like this:

out.Pos = WorldViewProj * (input.Pos + texld(input.texcoord) * input.norm)

You know that as a respected member of B3D, I'm going to quote you for this. ;)
 
I thought G70/G71 could use something similar to R2VB in OGL?

If so wouldn't this help alleviate some of the latency problems with vertex texturing with the architecture?
 
Dave Baumann said:
No, Mintmaster, surely you are going to selectively quote him...:!:
Not sure what you're implying here (do I have a habit of unfairly quoting people?), but I think the whole quote is good enough to prove the merits of a US regarding VTF.
 
ERP said:
I'll be interested in seeing R600 benchmarks vs G80 or whatever.
What's a R600 or a G80?

:)

Seriously though, I'm sure a whole lot of folks has the exact same interest!

Gubbi said:
Gateway2 said:
David Kirk: It's a stupid idea. However, we will do it eventually.
And then he'll say it's the best thing ever.
But probably only if a NVIDIA US hardware is faster than its fiercest competitor.

Regarding whether US (Unified Shading, not the country) is good or not; I dunno. From a programming perspective, anything that makes things easier generally-speaking for a programmer should be applauded but there will be so many programming situations where there will surely be tradeoffs when it comes to the benefits of uniformity versus unification.

To say that I hope the major IHVs will come up with the most pleasing US architecture solution vis-a-vis DX10 would probably be the understatement of the year for a lot of commercial programmers.
 
Xmas said:
G71 has 4 "vertex texture units", IIRC. They only support FP32 1- and 4-channel textures with point sampling, and texturing latency has to be hidden by independent ALU instructions, at least in part.

4 "vertex texture units" ? AFAIK, it should be 8.
 
RobertR1 said:
Is this the same genius that said that HDR+AA is useless for now and he doesn't see any reason to implement it for a while?
Well, that was just the marketing way to say that the transistor cost for supporting HDR+MSAA was too high and/or the ROP capabilities of the G70/71 are inherited from the NV40 architecture.
Gateway2 said:
I've wondered this, even if Nvidia has Xenos in the labs, could they run any software tests on it? Because X360 hasn't been cracked, and doesn't that mean only certified software can run on it?
They would only need a XDK (Development kit).
 
I think NVidia is banking on "performant" GS and VTF being practically irrelevent in upcoming titles. Dynamic branching hasn't even "taken off" yet. I think NVidia is betting that within the usable lifespan of the G80 (a year or so) most DX10 titles will still rely heavily on PS, and only dabble with GS/VTF rather than craft the entire engine around it. For the simple reason that support legacy HW/DX9L will tie developers hands. It's like when ATI bet that SM3.0, FP32, et al, wasn't important back in the R420 days, and that PS2.0 performance was where it's at.

So G80 will probably lose on synthetic tests and demos, but probably won't be disadvantaged in games before the 100% unified "G9x" arrives.
 
DemoCoder said:
I think NVidia is banking on "performant" GS and VTF being practically irrelevent in upcoming titles. Dynamic branching hasn't even "taken off" yet. I think NVidia is betting that within the usable lifespan of the G80 (a year or so) most DX10 titles will still rely heavily on PS, and only dabble with GS/VTF rather than craft the entire engine around it.

That's a pretty evident self-fulfilling prophecy though. If there's no speedy hardware of course devs are going to wait. But yeah, I agree that's what both IHVs will do at any rate and I only expect performant GS on second gen D3D10 parts.

But I was expecting VTF to have taken off by now since there's been hardware on devs' hands for what? Two years now? And the only - commercial - example that I know of is Pacific Fighters I think.
 
It seems obvious to me that NVIDIA plans to make use of a basic loopback mechanism, at worst at a micro-batch level. For at least 80% of the discussed points, this will be more than enough, and the cost is ludicrously below that of a true US.

Uttar
 
DemoCoder said:
I think NVidia is banking on "performant" GS and VTF being practically irrelevent in upcoming titles. Dynamic branching hasn't even "taken off" yet. I think NVidia is betting that within the usable lifespan of the G80 (a year or so) most DX10 titles will still rely heavily on PS, and only dabble with GS/VTF rather than craft the entire engine around it. For the simple reason that support legacy HW/DX9L will tie developers hands. It's like when ATI bet that SM3.0, FP32, et al, wasn't important back in the R420 days, and that PS2.0 performance was where it's at.

So G80 will probably lose on synthetic tests and demos, but probably won't be disadvantaged in games before the 100% unified "G9x" arrives.
I think you're absolutely right, and NVidia's prediction will come true. IMO ATI's bet was pretty sound on the things you mentioned, except they made the huge mistake of not including FP blending.

This is why unified shaders and fast DB make sense on a console, but less so on the PC which evolves with much smaller leaps (due to the target market having older tech). It seems like DX10 will help in separating the past from the future, but only to a certain degree.

I think VTF can be very powerful in data amplification via displacement mapping, but you need a fair investment to make an engine in this way, and it obviously won't work with all hardware. ATI seems to have done a very poor job of setting up devs for upcoming features. When NV40 was introduced, it seemed like they were discouraging dynamic branching, but they should have got the ball rolling then. Not including VTF this gen and providing R2VB instead will probably hurt them in the future, at which point it may be NVidia advocating R2VB for texture heavy vertex programs.

The little things really tend to snowball in this industry.
 
Mordenkainen said:
That's a pretty evident self-fulfilling prophecy though. If there's no speedy hardware of course devs are going to wait. But yeah, I agree that's what both IHVs will do at any rate and I only expect performant GS on second gen D3D10 parts.

But I was expecting VTF to have taken off by now since there's been hardware on devs' hands for what? Two years now? And the only - commercial - example that I know of is Pacific Fighters I think.


Given the performance of VTF on G7x and the fact no competitors card supports it, your not going to see it used much.
 
Back
Top