David Kirk of NVIDIA talks about Unified-Shader (Goto's article @ PC Watch)

Discussion in 'Architecture and Products' started by one, Apr 19, 2006.

  1. NocturnDragon

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    393
    Likes Received:
    17
    nVidia has a very efficient shader model 2.0 architecture right now, the 3.0 parts (vertex textures and DB) aren't that efficient, so it really doesn't tell you anything about G80.

    As for Xenos, it doens't tell much either about a possible R600 ( ok, maybe a little more than comparing g71 to what the g80 will be but still!). For a start the PC part won't have edram, no daughter die, and so on.
     
  2. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,237
    Likes Received:
    4,260
    Location:
    Guess...
    Can I ask why it would be better in dynamic branching by a factor of 10? Doesn't Xenos only have twice the Vertex Texture units of G71?

    Also, what kind of dynamic branching ability does Xenos have over G71? I.e. where do both perform this function and why so much extra on Xenos? With R520 its obvious because it has the dedicated units but I didn't think Xenos had that?
     
  3. ChrisRay

    ChrisRay <span style="color: rgb(124, 197, 0)">R.I.P. 1983-
    Veteran

    Joined:
    Nov 25, 2002
    Messages:
    2,234
    Likes Received:
    26

    That still doesnt really change his point that he is making in my opinion. The 6600 verses the 5800U is a very unique comparison because it uses the same speed/memory and clock speeds. Add in a NV35 variant chip such as the 5950 Ultra and it creates an interesting comparison. Anyone who has owned an Nv35 chip will tell you that the extra bandwith did very little good for the actual hardware. ((Compare 5900XT verses 5900 NU with virtually identical performance properties ((relative 1% percent difference)). Then throw in a piece of hardware like the 6600, Which not only had a more efficient pipeline structure, But less than half the bandwith of the Nv3x chips)) its pretty obvious to me that the Nv3x did not benefit much from the extra bandwith in comparison to current hardware as I feel the Nv35 line's bandwith was mostly underutilized due to its pipeline structure.

    I still think his point is fair too. The 6600GT with a superior pipeline/pixel ALU layout can easily outperform a piece of hardware with nearly twice the available bandwith in most circumstances and that the extra bandwith probably would not have been a neccesity had Nvidia had a stronger performing chip at the time.
     
  4. TurnDragoZeroV2G

    Regular

    Joined:
    Nov 14, 2005
    Messages:
    583
    Likes Received:
    23
    Location:
    Who knows...
    Having the units doesn't mean they're any good at what they're intended for.

    Well, there was some reference in a recent presentation that mentioned it did have dedicated branch execution units. Which might have been a mistake, as it also mentioned Xenos batchsize to be 48 instead of 64 pixels, and G7x to be 100 pixels instead of something like 800 or whatever. AFAIK, Pixel batch size is far more important to dynamic branching, though. It's what separates R520 from G70, and R520 from R580, and so on. Moreso, at least, than the dedicated branching unit.
     
  5. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,455
    Likes Received:
    471
    Exactly
     
  6. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    As I said before, when you have smaller processes you can spend more transistors on other things, which will include bandwidth saving techniques. "Bandwidth" is only one element of the overall performance and there are multiple factors often in play, all of which will be dependant on the architecture.

    Of course, comparing architectures is like comparing apples and oranges on such matters. If we take a look at 9500 PRO which, while having a 125% bandwidth deficit to 9700 PRO, only has an 18% core performance deficit, yet the 9700 PRO yeilds an average performance increase over the 9500 PRO in the game & 3DMark shader tests of 34%, peaking at 53% in game and 64% in one of the 3DMark shader tests (!), without factoring in any AA performance differences. Is that worth it or not?
     
  7. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,237
    Likes Received:
    4,260
    Location:
    Guess...
    Agreed, but that can apply to both Xenos and G71. What I want to know is why are people saying Xenos is so much better at vertex texturing, im assuming there is some technical explanation that im missing and that its not just an assumption.


    But if it does have dedicated units, where are they? In the unified ALU's? A seperate array like the texture samplers and if so, how many? I remember reading that the scheduler could be programmed for dynamic branching, is this it?
     
  8. Voltron

    Newcomer

    Joined:
    May 25, 2004
    Messages:
    192
    Likes Received:
    3
    This is ridiculous. You are saying Kirk is wrong because the extra transistors in the 6600 went to bandwidth saving techniques? If that were the case, then why does it outperform the nv35. A simple look at benchmarks shows that the 6600 has massively improved shaders. Of course it architecturally different. Thats my point. The NV30 architecture sucked, but 128-bit is more than sufficient when the shader performance is there. Or perhaps the bandwidth saving techniques were originally intended for NV30, but didnt make it there or were broken.
     
  9. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    "...spend more transistors on other things, which will include bandwidth saving techniques. "Bandwidth" is only one element of the overall performance and there are multiple factors often in play, all of which will be dependant on the architecture."
     
  10. ondaedg

    Regular

    Joined:
    Oct 5, 2003
    Messages:
    350
    Likes Received:
    1
    I disagree with you. As power requirements are getting higher for CPUs and GPUs, it is advantageous to design a chip that is drawing less power and requires a smaller, less noisy fan to keep it at the correct operating temperature. The 7900 gt and the 7600 gt are perfect examples of this. Lower power requirements and equal or better performance than competing products in their price bracket. This is not meant to be a red vs green discussion, but more of a smart design discussion.
     
  11. ants

    Newcomer

    Joined:
    Feb 10, 2006
    Messages:
    44
    Likes Received:
    3
    AFAIK (someone please correct me if I am wrong).

    The NV4x/G7x vertex texturing units are extremely limited, supporting 2 textures types, no filtering (except point) and are very slow.

    Xenos uses the same texture mapping units for vertex and fragment data (unified) so you get the same formats supported in PS and VS, the filtering support and they are very fast.

    HTH
     
  12. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,262
    Likes Received:
    22
    Location:
    Land of the 25% VAT
    Bingo! This is exactly the route I expect nVidia will take with their upcoming DX10 part. I also except this partial unification to be a good compromise between performance and their G80 silicon budget. I'm still not convinced that the extra support logic needed for fully US makes it worthwhile just yet. If ATI and nVidia had a 65nm process, maybe, but that wont be the case this year. :neutral:
     
  13. Voltron

    Newcomer

    Joined:
    May 25, 2004
    Messages:
    192
    Likes Received:
    3
    So in an extremely roundabout way you are saying that Kirk's statements, which I believe were made before 9700 launched (though maybe NVIDIA had samples) were purely made as FUD? Wonder why they didn't try stick a 256-bit bus on NV30 in the first place if he didn't believe that. So maybe he was just wrong, and they got really lucky with the NV40, rather than screwing up the NV30.
     
  14. _xxx_

    Banned

    Joined:
    Aug 3, 2004
    Messages:
    5,008
    Likes Received:
    86
    Location:
    Stuttgart, Germany
    Voltron, it's not all black and white...:smile:
     
  15. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,237
    Likes Received:
    4,260
    Location:
    Guess...
    Thanks for that, the comments about the supported texture types and comparitive lack of filtering make sense. But im still wondering where the fast vs slow thing comes from. Was it a dev comment, a benchmark, or something techncial?
     
  16. Voltron

    Newcomer

    Joined:
    May 25, 2004
    Messages:
    192
    Likes Received:
    3
    It's definitely not all black and white, that's for sure...

    My orginal point way back when was that Kirk's statement shouldn't necessarily be dimissed as FUD, and I provided some context as for why I thought this. I think that that logic is pretty compelling, actually, as far as speculation goes, which is what this is all about. It is funny how all of that has been lost in this silly little tangent, which I am not the only person engaged in.
     
  17. andypski

    Regular

    Joined:
    May 20, 2002
    Messages:
    584
    Likes Received:
    28
    Location:
    Santa Clara
    A statement like "128-bit is more than sufficient when the shader performance is there" makes little sense. The faster the shader throughput is the more likely you are to be bandwidth limited, hence you need more bandwidth, not less. So you could more logically state "128-bit is more than enough when your shader performance is not there".

    I think that NV35 versus 6600GT is very apples-to-oranges anyway - too many architectural differences to draw many conclusions.

    Balances change over time - back in the R300 timeframe the effective length of shaders was low (think UT2003-style rendering), so the effective throughput of pixels per clock was pretty high. As such a 256-bit bus on the slow memories of the time was by no means overkill for the predominant rendering techniques, as demonstrated by the large lead 9700 Pro typically had over 9500 Pro when antialiasing was applied.

    Move forward to today and shaders become longer, typical pixels per clock for the same number of pipelines can decrease, and bandwidth requirements can therefore actually drop rather than increasing.

    So to get back to the point, 128-bit was/is enough by what measurement? 256-bit clearly gave sufficient advantage at the time to be worthwhile at the high end - 9700 Pro scaled pretty well from 9500 Pro, so given a like:like architecture there was opportunity there. Most particularly 9700 was aimed to perform well with AA, and it did, but the bandwidth of the 256-bit bus was important to attaining this level of performance.

    A 6600GT with 128-bit memory does pretty well against a 9700Pro with 256-bit memory at AA tests, but it should since it actually has more available bandwidth. The only way to get this level of bandwidth at that time was to double the bus width - the faster memory simply wasn't there.

    Could a 6600GT go significantly faster with 256-bit memory and AA? Maybe. Does it make sense for the price/performance point you are aiming to hit? Maybe not. In the high-end at any given time the tradeoffs will be different. Saying "128-bit is enough" could just be viewed as a marketing way of saying "Uh-oh, we don't have 256-bit - My god! Look, over there, a three headed monkey!".

    Going back to when the original statement about 128-bit versus 256-bit was made, if you truly believe that 128-bit is enough and that you're not bandwidth-starved compared to the competition then you wouldn't then need to clock the 128-bit memory on your high end part so much faster than the typically available memory of the time that you get appalling yields would you? Of course, if other factors are contributing to poor yields as well (like needing to massively overclock the core to be competitive) then you might be able to scrape enough fast memory together to go with the few parts that will clock at those speeds.
     
    #57 andypski, Apr 19, 2006
    Last edited by a moderator: Apr 19, 2006
    digitalwanderer and Jawed like this.
  18. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
    From other threads on this board, VTF on Nvidia HW is slow at least partly because there are normally a lot less vertices in flight than there are pixels (i.e. hitting external memory is likely to stall the vertex shader). A unified architecture is better able to hide memory latency since it can use pixels to cover the memory latency of VTFs.
     
  19. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Well, it just is. Check this out: http://www.gpumania.com/edicion/articulos/gpun1540.png
    I know that's not Xenos, but I'm pretty sure the branching architecture is similar to the X1K series. That's what we've heard from Dave, that's what we can assume from this presentation, it makes sense when considering engineering cost, etc.

    It's not a very practical example, but it shows potential. More usable techniques are usually in the neigborhood of 2-3x, but you never know what techniques will pop up in the future. Still, even 2x is plenty to shift the perf/mm2 balance towards ATI.

    The reason dedicated vertex shaders occupy such little die space is that they don't have to texture. If you want them to texture fast, they have to absorb latency, which means working on many vertices at once so that the time between a texture request and the received data isn't wasted. Currently, idling during this time does happen on NVidia hardware.

    I unfortunately don't have data for G70/G71, but I read that NV40's vertex shaders can do 22.5 million texture fetches per second, measured. Compare that to the 4.8 billion math instructions per second that they're capable of.

    The unified pipelines of Xenos allow it to potentially reach a peak of 8 billion texture fetches per second. Shading a vertex is done with the same hardware as shading a pixel. That's a factor of 350, so saying a factor of 10 is very conservative (I chose it because G70 will be better, shaders will vary, etc).

    Also, render to vertex buffer (R2VB) will alleviate some of the pressure of vertex texturing, but there are some drawbacks, and it's less "natural". NVidia doesn't support this right now anyway on the PC side.
     
    #59 Mintmaster, Apr 19, 2006
    Last edited by a moderator: Apr 19, 2006
  20. superguy

    Banned

    Joined:
    Jan 27, 2006
    Messages:
    472
    Likes Received:
    9

    I've wondered this, even if Nvidia has Xenos in the labs, could they run any software tests on it? Because X360 hasn't been cracked, and doesn't that mean only certified software can run on it?

    I mean I guess they could X-Ray it, but they cant run hypothetical Nvidia-3DMark 06 for consoles on it, I dont believe.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...