David Kirk of NVIDIA talks about Unified-Shader (Goto's article @ PC Watch)

NocturnDragon · Apr 19, 2006

Voltron said:
Meaningless or not, NVIDIA has a very efficient architecture right now.

nVidia has a very efficient shader model 2.0 architecture right now, the 3.0 parts (vertex textures and DB) aren't that efficient, so it really doesn't tell you anything about G80.

As for Xenos, it doens't tell much either about a possible R600 ( ok, maybe a little more than comparing g71 to what the g80 will be but still!). For a start the PC part won't have edram, no daughter die, and so on.

pjbliverpool · Apr 19, 2006

Mintmaster said:
Umm, R420/R480 certainly wasn't lacking in performance per sq mm. Performance per sq mm per clock, sure, but that's a useless metric.

One thing you have to take into account is that Xenos could outperform G71/RSX by a factor of 10 in a dynamic branching shader or vertex texturing shader. These are the two hallmarks of SM3.0, I might add.

Can I ask why it would be better in dynamic branching by a factor of 10? Doesn't Xenos only have twice the Vertex Texture units of G71?

Also, what kind of dynamic branching ability does Xenos have over G71? I.e. where do both perform this function and why so much extra on Xenos? With R520 its obvious because it has the dedicated units but I didn't think Xenos had that?

ChrisRay · Apr 19, 2006

Dave Baumann said:
Chris, they are different parts built for different utilisations, on different processes and different memories - you target for all these factors. Both NV43 and G73 use smaller processes than NV30 did, ergo they can make different design decisions based on the different costs. G73, now, uses a 128-bit bus but with 700MHz memory thats fairly common now, providing ~3GB/s more bandwidth than 9700 PRO's 256-bit bus with common for the time 325MHz memory, and thats what Kirk was commenting against when he made the quote.

That still doesnt really change his point that he is making in my opinion. The 6600 verses the 5800U is a very unique comparison because it uses the same speed/memory and clock speeds. Add in a NV35 variant chip such as the 5950 Ultra and it creates an interesting comparison. Anyone who has owned an Nv35 chip will tell you that the extra bandwith did very little good for the actual hardware. ((Compare 5900XT verses 5900 NU with virtually identical performance properties ((relative 1% percent difference)). Then throw in a piece of hardware like the 6600, Which not only had a more efficient pipeline structure, But less than half the bandwith of the Nv3x chips)) its pretty obvious to me that the Nv3x did not benefit much from the extra bandwith in comparison to current hardware as I feel the Nv35 line's bandwith was mostly underutilized due to its pipeline structure.

I still think his point is fair too. The 6600GT with a superior pipeline/pixel ALU layout can easily outperform a piece of hardware with nearly twice the available bandwith in most circumstances and that the extra bandwith probably would not have been a neccesity had Nvidia had a stronger performing chip at the time.

TurnDragoZeroV2G · Apr 19, 2006

pjbliverpool said:
Can I ask why it would be better in dynamic branching by a factor of 10? Doesn't Xenos only have twice the Vertex Texture units of G71?

Having the units doesn't mean they're any good at what they're intended for.

Also, what kind of dynamic branching ability does Xenos have over G71? I.e. where do both perform this function and why so much extra on Xenos? With R520 its obvious because it has the dedicated units but I didn't think Xenos had that?

Well, there was some reference in a recent presentation that mentioned it did have dedicated branch execution units. Which might have been a mistake, as it also mentioned Xenos batchsize to be 48 instead of 64 pixels, and G7x to be 100 pixels instead of something like 800 or whatever. AFAIK, Pixel batch size is far more important to dynamic branching, though. It's what separates R520 from G70, and R520 from R580, and so on. Moreso, at least, than the dedicated branching unit.

no-X · Apr 19, 2006

RobertR1 said:
Is this the same genius that said that HDR+AA is useless for now and he doesn't see any reason to implement it for a while?

Exactly

Dave Baumann · Apr 19, 2006

That still doesnt really change his point that he is making in my opinion.

As I said before, when you have smaller processes you can spend more transistors on other things, which will include bandwidth saving techniques. "Bandwidth" is only one element of the overall performance and there are multiple factors often in play, all of which will be dependant on the architecture.

Of course, comparing architectures is like comparing apples and oranges on such matters. If we take a look at 9500 PRO which, while having a 125% bandwidth deficit to 9700 PRO, only has an 18% core performance deficit, yet the 9700 PRO yeilds an average performance increase over the 9500 PRO in the game & 3DMark shader tests of 34%, peaking at 53% in game and 64% in one of the 3DMark shader tests (!), without factoring in any AA performance differences. Is that worth it or not?

pjbliverpool · Apr 19, 2006

TurnDragoZeroV2G said:
Having the units doesn't mean they're any good at what they're intended for.

Agreed, but that can apply to both Xenos and G71. What I want to know is why are people saying Xenos is so much better at vertex texturing, im assuming there is some technical explanation that im missing and that its not just an assumption.

Well, there was some reference in a recent presentation that mentioned it did have dedicated branch execution units. Which might have been a mistake, as it also mentioned Xenos batchsize to be 48 instead of 64 pixels, and G7x to be 100 pixels instead of something like 800 or whatever. AFAIK, Pixel batch size is far more important to dynamic branching, though. It's what separates R520 from G70, and R520 from R580, and so on. Moreso, at least, than the dedicated branching unit.

But if it does have dedicated units, where are they? In the unified ALU's? A seperate array like the texture samplers and if so, how many? I remember reading that the scheduler could be programmed for dynamic branching, is this it?

Voltron · Apr 19, 2006

Dave Baumann said:
As I said before, when you have smaller processes you can spend more transistors on other things, which will include bandwidth saving techniques. "Bandwidth" is only one element of the overall performance and there are multiple factors often in play, all of which will be dependant on the architecture.

Of course, comparing architectures is like comparing apples and oranges on such matters. If we take a look at 9500 PRO which, while having a 125% bandwidth deficit to 9700 PRO, only has an 18% core performance deficit, yet the 9700 PRO yeilds an average performance increase over the 9500 PRO in the game & 3DMark shader tests of 34%, peaking at 53% in game and 64% in one of the 3DMark shader tests (!), without factoring in any AA performance differences. Is that worth it or not?

This is ridiculous. You are saying Kirk is wrong because the extra transistors in the 6600 went to bandwidth saving techniques? If that were the case, then why does it outperform the nv35. A simple look at benchmarks shows that the 6600 has massively improved shaders. Of course it architecturally different. Thats my point. The NV30 architecture sucked, but 128-bit is more than sufficient when the shader performance is there. Or perhaps the bandwidth saving techniques were originally intended for NV30, but didnt make it there or were broken.

Dave Baumann · Apr 19, 2006

"...spend more transistors on other things, which will include bandwidth saving techniques. "Bandwidth" is only one element of the overall performance and there are multiple factors often in play, all of which will be dependant on the architecture."

ondaedg · Apr 19, 2006

Bouncing Zabaglione Bros. said:
That's fine for the manufacturer, but the end user doesn't care if his die is a little bigger and a little less efficient per square mm if it means he get a big performance hike over the competing product.

I disagree with you. As power requirements are getting higher for CPUs and GPUs, it is advantageous to design a chip that is drawing less power and requires a smaller, less noisy fan to keep it at the correct operating temperature. The 7900 gt and the 7600 gt are perfect examples of this. Lower power requirements and equal or better performance than competing products in their price bracket. This is not meant to be a red vs green discussion, but more of a smart design discussion.

ants · Apr 19, 2006

pjbliverpool said:
Agreed, but that can apply to both Xenos and G71. What I want to know is why are people saying Xenos is so much better at vertex texturing, im assuming there is some technical explanation that im missing and that its not just an assumption.

AFAIK (someone please correct me if I am wrong).

The NV4x/G7x vertex texturing units are extremely limited, supporting 2 textures types, no filtering (except point) and are very slow.

Xenos uses the same texture mapping units for vertex and fragment data (unified) so you get the same formats supported in PS and VS, the filtering support and they are very fast.

HTH

LeStoffer · Apr 19, 2006

Dave Baumann said:
I think the easiest route for "partial" unification is to unify the vertex shaders and geometry shaders, while still keeping the pixel shaders separate - all the geometry operation types will be unified, but still a separation between geometry and pixel ops.

Bingo! This is exactly the route I expect nVidia will take with their upcoming DX10 part. I also except this partial unification to be a good compromise between performance and their G80 silicon budget. I'm still not convinced that the extra support logic needed for fully US makes it worthwhile just yet. If ATI and nVidia had a 65nm process, maybe, but that wont be the case this year. :neutral:

Voltron · Apr 19, 2006

Dave Baumann said:
"...spend more transistors on other things, which will include bandwidth saving techniques. "Bandwidth" is only one element of the overall performance and there are multiple factors often in play, all of which will be dependant on the architecture."

So in an extremely roundabout way you are saying that Kirk's statements, which I believe were made before 9700 launched (though maybe NVIDIA had samples) were purely made as FUD? Wonder why they didn't try stick a 256-bit bus on NV30 in the first place if he didn't believe that. So maybe he was just wrong, and they got really lucky with the NV40, rather than screwing up the NV30.

_xxx_ · Apr 19, 2006

Voltron, it's not all black and white...:smile:

pjbliverpool · Apr 19, 2006

ants said:
AFAIK (someone please correct me if I am wrong).

The NV4x/G7x vertex texturing units are extremely limited, supporting 2 textures types, no filtering (except point) and are very slow.

Xenos uses the same texture mapping units for vertex and fragment data (unified) so you get the same formats supported in PS and VS, the filtering support and they are very fast.

HTH

Thanks for that, the comments about the supported texture types and comparitive lack of filtering make sense. But im still wondering where the fast vs slow thing comes from. Was it a dev comment, a benchmark, or something techncial?

Voltron · Apr 19, 2006

It's definitely not all black and white, that's for sure...

My orginal point way back when was that Kirk's statement shouldn't necessarily be dimissed as FUD, and I provided some context as for why I thought this. I think that that logic is pretty compelling, actually, as far as speculation goes, which is what this is all about. It is funny how all of that has been lost in this silly little tangent, which I am not the only person engaged in.

andypski · Apr 19, 2006

Voltron said:
This is ridiculous. You are saying Kirk is wrong because the extra transistors in the 6600 went to bandwidth saving techniques? If that were the case, then why does it outperform the nv35. A simple look at benchmarks shows that the 6600 has massively improved shaders. Of course it architecturally different. Thats my point. The NV30 architecture sucked, but 128-bit is more than sufficient when the shader performance is there. Or perhaps the bandwidth saving techniques were originally intended for NV30, but didnt make it there or were broken.

A statement like "128-bit is more than sufficient when the shader performance is there" makes little sense. The faster the shader throughput is the more likely you are to be bandwidth limited, hence you need more bandwidth, not less. So you could more logically state "128-bit is more than enough when your shader performance is not there".

I think that NV35 versus 6600GT is very apples-to-oranges anyway - too many architectural differences to draw many conclusions.

Balances change over time - back in the R300 timeframe the effective length of shaders was low (think UT2003-style rendering), so the effective throughput of pixels per clock was pretty high. As such a 256-bit bus on the slow memories of the time was by no means overkill for the predominant rendering techniques, as demonstrated by the large lead 9700 Pro typically had over 9500 Pro when antialiasing was applied.

Move forward to today and shaders become longer, typical pixels per clock for the same number of pipelines can decrease, and bandwidth requirements can therefore actually drop rather than increasing.

So to get back to the point, 128-bit was/is enough by what measurement? 256-bit clearly gave sufficient advantage at the time to be worthwhile at the high end - 9700 Pro scaled pretty well from 9500 Pro, so given a like:like architecture there was opportunity there. Most particularly 9700 was aimed to perform well with AA, and it did, but the bandwidth of the 256-bit bus was important to attaining this level of performance.

A 6600GT with 128-bit memory does pretty well against a 9700Pro with 256-bit memory at AA tests, but it should since it actually has more available bandwidth. The only way to get this level of bandwidth at that time was to double the bus width - the faster memory simply wasn't there.

Could a 6600GT go significantly faster with 256-bit memory and AA? Maybe. Does it make sense for the price/performance point you are aiming to hit? Maybe not. In the high-end at any given time the tradeoffs will be different. Saying "128-bit is enough" could just be viewed as a marketing way of saying "Uh-oh, we don't have 256-bit - My god! Look, over there, a three headed monkey!".

Going back to when the original statement about 128-bit versus 256-bit was made, if you truly believe that 128-bit is enough and that you're not bandwidth-starved compared to the competition then you wouldn't then need to clock the 128-bit memory on your high end part so much faster than the typically available memory of the time that you get appalling yields would you? Of course, if other factors are contributing to poor yields as well (like needing to massively overclock the core to be competitive) then you might be able to scrape enough fast memory together to go with the few parts that will clock at those speeds.

psurge · Apr 19, 2006

From other threads on this board, VTF on Nvidia HW is slow at least partly because there are normally a lot less vertices in flight than there are pixels (i.e. hitting external memory is likely to stall the vertex shader). A unified architecture is better able to hide memory latency since it can use pixels to cover the memory latency of VTFs.

Mintmaster · Apr 19, 2006

pjbliverpool said:
Can I ask why it would be better in dynamic branching by a factor of 10?

Well, it just is. Check this out: http://www.gpumania.com/edicion/articulos/gpun1540.png
I know that's not Xenos, but I'm pretty sure the branching architecture is similar to the X1K series. That's what we've heard from Dave, that's what we can assume from this presentation, it makes sense when considering engineering cost, etc.

It's not a very practical example, but it shows potential. More usable techniques are usually in the neigborhood of 2-3x, but you never know what techniques will pop up in the future. Still, even 2x is plenty to shift the perf/mm2 balance towards ATI.

What I want to know is why are people saying Xenos is so much better at vertex texturing, im assuming there is some technical explanation that im missing and that its not just an assumption.

The reason dedicated vertex shaders occupy such little die space is that they don't have to texture. If you want them to texture fast, they have to absorb latency, which means working on many vertices at once so that the time between a texture request and the received data isn't wasted. Currently, idling during this time does happen on NVidia hardware.

I unfortunately don't have data for G70/G71, but I read that NV40's vertex shaders can do 22.5 million texture fetches per second, measured. Compare that to the 4.8 billion math instructions per second that they're capable of.

The unified pipelines of Xenos allow it to potentially reach a peak of 8 billion texture fetches per second. Shading a vertex is done with the same hardware as shading a pixel. That's a factor of 350, so saying a factor of 10 is very conservative (I chose it because G70 will be better, shaders will vary, etc).

Also, render to vertex buffer (R2VB) will alleviate some of the pressure of vertex texturing, but there are some drawbacks, and it's less "natural". NVidia doesn't support this right now anyway on the PC side.

superguy · Apr 19, 2006

Voltron said:
The general problem with NV30 was that it sucked. NV35 with 256-bit was better, but it still sucked.

Meaningless or not, NVIDIA has a very efficient architecture right now. And xenos gives them some insight into what ATI's future products will be. It would be naive to think that NVIDIA does not have tools in the lab to make potential comparisons meaningful at least to them, even if they have imperfect information.

I've wondered this, even if Nvidia has Xenos in the labs, could they run any software tests on it? Because X360 hasn't been cracked, and doesn't that mean only certified software can run on it?

I mean I guess they could X-Ray it, but they cant run hypothetical Nvidia-3DMark 06 for consoles on it, I dont believe.

David Kirk of NVIDIA talks about Unified-Shader (Goto's article @ PC Watch)

NocturnDragon

pjbliverpool

B3D Scallywag

ChrisRay

<span style="color: rgb(124, 197, 0)">R.I.P. 1983-

TurnDragoZeroV2G

no-X

Dave Baumann

Gamerscore Wh...

pjbliverpool

B3D Scallywag

Voltron

Dave Baumann

Gamerscore Wh...

ondaedg

ants

LeStoffer

Voltron

_xxx_

pjbliverpool

B3D Scallywag

Voltron

andypski

psurge

Mintmaster

superguy

Similar threads