G80 rumours

Discussion in 'Pre-release GPU Speculation' started by IbaneZ, Feb 21, 2006.

Thread Status:
Not open for further replies.
  1. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Sure. But the important point here is that you don't want to have to have the full temp register file to store all of the vertices required for latency hiding in the texture units. Thus you make use of the pixel shader's register file.

    Ah, you're right, I was missing a little something. But you can still solve the problem by sharing with the pixel shader's register file.
     
  2. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    So now you're using the pixel shader's FIFO for storing the in flight vertex data?

    You're not far from a unified shader architecture now, except your solution goes through the trouble of putting all the data in the same place (the hardest part of a USA), but still using different execution units.

    Doesn't make much sense to me.
     
  3. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    I'm not seeing how it would be that difficult.

    Anyway, I suppose it all depends upon how the register file is stored. The way I'm thinking of it there would be a "local" register file within the ALU's or TEX units, a "global" register file stored somewhere else, and a queue for each unit (just pointers to locations in the global register file). The local register file would only need to store those few registers that are read from during the execution of a few instructions, as well as a pointer to the requisite position in the global register file.

    Then you have the queue for each unit. The queue in each unit is just a list of pointers to instructions that can be executed right now. Each clock, the unit in question takes the next instruction from the queue, loading the values it needs from the global register file, and placing the instruction in the pipeline (or instructions, in the case of multi-issue), with one limitation: if the small output buffer of the unit is not flushed, then the unit stops execution, not reading from the input queue.

    With the above sort of design, I'm really not seeing much of any problem. That is to say, each execution unit still has full control over its own execution. It's just the location where the data is stored that changes.
     
  4. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    Also because there is no need. No forseeable workload at present pits a 1:1 vertex/pixel ratio. Most games are going to be PS limited, and most VS programs are short in comparison. Having 1:1 vertices in flight would be a waste. Vertices by their vary nature are batch data due to their decidedly lower frequency. A single vertex can generate hundreds of pixels.

    I'm not sure "latency free" VTF is needed. Only "less latency". Unless your game is vertex limited, I don't see the benefit of striving for latency free. I could be convinced.

    Unified processing is orthogonal to thread batching. No one's claiming non-unified "saves more" or is "more efficient", only that it is "not needed" to achieve competitive performance.


    You need storage for 32 registers max, a pointer for each input (16 possible), interpolator state, screen position, predicate, Z, loop (maybe), plus some way to identify which outstanding texture requests pertain to which fragment. If you can handle say, 10,000 pixels in which, even with batches of 16, that's 625 per-batch state, plus 10,000*per pixel state, vs a few dozen vertices in flight, if even that. Doubling or tripling the amount of state available for vertices is much lower cost than doubling pixels, and vertices don't need latency free VTF IMHO. My point is, the vastly lower frequency of vertex processing requires alot less concurrent contextual state.

    I don't see how you can justify the idea that increasing the amount of vertex state is expensive vis-a-vis pixel shaders. Only if one held the few that every triangle was a sub-pixel in size, would this even begin to be a relevant point.


    False argument. There are many 3D features which have been comparably cheap in the past, but still left unimplemented until later chip revisions. IHVs have many reasons for leaving out features or enhancements which do not always have to do with die space. You should be well aware that in any given project, developers and engineers have a laundry list of enhancements, features, and changes they want to make, and not all of them make it into a product release, even if they are easy or relatively cheap to do, because other priorities exist, like time to market.

    I don't buy the argument that IHVs add only what is absolutely the best course of action given die space. There are lots of features that have made it into graphics cards that were frankly hardly ever used and essentially wasted space, and not always because of performance deficits, but market-mismatch. (Npatches anyone?)
     
    #64 DemoCoder, Feb 22, 2006
    Last edited by a moderator: Feb 22, 2006
  5. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    NV's HOS in NV20 would be another example. Do large IHVs though nowadays opt for such risks or do they rather implement what is absolutely necessary?

    It seems to get even worse with D3D10.
     
  6. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    You're not understanding my point. The per-processor cost is what's important when comparing this solution to a USA, because the latter will only spend a small fraction of it's time on vertices when vertices to pixels is much less than 1:1.

    Consider a fixed ratio of 7:1 (pixel to vertex) for the sake of not giving a USA any advantage. 42 PS & 6 VS for the traditional architecture, 48 units for the USA. For a 100 cycle texture latency, the former needs a 600 vert cache for fast VTF. The latter's 4800 pixel/vertex cache would be used for 12.5% of the time. In the end, the die cost is the same if you assume pixels and vertices need the same space for state data, so this aspect of the argument is irrelevant.

    The advantage of the USA, of course, is when the load deviates one way or the other from 7:1. The traditional architecture will have both the cache and execution units of either the VS or PS idle. Right now, we don't have fast VTF, thus the register space is small in the VS and we don't care much if it sits idle.

    Not sure why you're saying a USA is orthogonal to the issue. We're talking about pixels vs. vertices here, and why it makes sense to separate the processing units versus unifying them. Of course you could do fast VTF the easy way by simply lengthening the FIFO in your vertex pipeline.

    It depends on the technique. For simple displacement mapping, you can request your fetch, then transform your point and normal, and when you have the data just displace along the normal. 6 cycle latency should keep the pipeline from stalling.

    Other uses of VTF are not so forgiving:
    -You could move low frequency effects from the pixel shader to the vertex shader. One example: Store SH coefficients in a 3D texture to represent incoming light at any point in space, and then do per-vertex PRT. This saves a series of per-pixel 3D texture loads.
    -Physics in the vertex shader could need multiple dependent accesses
    -Techniques like this: Dynamic Ambient Occlusion and Indirect Lighting (not my favourite technique, but an example nonetheless)
    -This caustics algorithm, and any other raytracing type things for vertices

    There are plenty of other possibilities. Remember that we're just getting our feet wet in VTF right now. A few years ago I made a similar mistake of short-sightedness in not seeing the cost of dynamic branching (i.e. lower processing density) outweighting the benefits, especially with a stencil buffer there to serve us. Now I'm pretty sure I was wrong.

    Of course, you could make the case for R2VB, but that's another debate.

    I guarantee you that no current GPU can give you latency free texturing (which is bandwidth efficient, naturally) when there are truly 32 live registers after compiler optimization. NV40/G70 start (gracefully) dropping in speed after using only a couple AFAIK. The facility for 32 registers is just there for flexibility. If you have lots of math or don't need texture results immediately then you don't need as many pixels in flight.

    I don't see why you need a pointer for each input. One primitive pointer to the post transform cache is enough AFAICS, so I'd like a little more explanation please.

    Screen position, loop, predicate, and request pointer are small potatoes (<10 bytes?). Per sample Z should be done afterwards using the primitive pointer, because there's no need to calculate it beforehand and store it. Top of the pipe Z-reject is per quad, as anything more detailed is pointless.

    I guess we can't settle this without having HDL for modern processors, but IMHO pixel state information is at worst comparable to vertex state info. The latter can reach 172 bytes with 10 iterators and position, if I'm not mistaken, and right now there's no cost for using all the iterators.

    See the first point in this post.

    Okay, fair enough. That point of mine is not very strong. I still think NVidia could have used this feature to their advantage if it was fast, and there would be more than just one game using the feature.
     
    #66 Mintmaster, Feb 23, 2006
    Last edited by a moderator: Feb 23, 2006
  7. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    As a baseline you should assume that anyone from any company that is allowed to talk to the press is doing it solely for PR reasons, having been fully briefed and coached by PR, and available solely at the PR departments wishes unless continuously proven otherwise (which won't happen cause they'll be fired long before then).

    Unless Nvidia is unlike everyother company out there (unlikely), someone in Kirk's position is primarily PR and Management and very little either day to day or mid to long term architecture.

    Aaron Spink
    speaking for myself inc.
     
    #67 aaronspink, Feb 23, 2006
    Last edited by a moderator: Feb 23, 2006
  8. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Chalnoth, if you have full speed texturing then you need a queue to absorb the latency entirely. Once you have this queue, there's no need for any more, because all other instructions have less latency. Of course the texture unit must keep track of all the requests, but that's an independent system with relatively small storage requirements.

    If you put this queue in the texture unit, then you couldn't hide extra latency (say from bandwidth restrictions or incoherent access) with additional ALU instructions, because your pixel processors can't access the data. Xmas knows what he's talking about, and it makes sense to keep the queue with the pixel processors.

    I didn't say you're method is "difficult", I just said it'll need just about all the routing a USA needs, but you're not getting the load sharing benefits. You bring all your vertex data to a place where they can be operated on by the powerful and plentiful pixel shading units, but all they're allowed to do is load texture data.
     
  9. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    But you escape the problem of load balancing entirely this way, which is apparently what nVidia is worried about. And besides, there can be some benefits in going "half way" as it allows you to do better research on how to go all the way.

    Regardless, it is fairly probable that nVidia won't change vertex texturing much for the next architecture. It's more likely that improvements to the pixel shader will be beneficial.
     
  10. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Really? That's surprising.

    This I'll agree with. My original comment in this thread was that keeping the VS and PS separate makes sense if you don't care about VTF. I don't think they really should care that much from a practical point of view (though a developer point of view is very different).
     
  11. JF_Aidan_Pryde

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    601
    Likes Received:
    3
    Location:
    New York
    Er.. I think saying that he's just a PR prop and that he doesn't get involved with archiecture is a little extreme. You don't need to have a Phd at Caltech to do that.
     
  12. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,723
    Likes Received:
    242
    from reading some of the posts about G80 and G90, the current thinking is something like:

    G80 = NV50 (or a revised NV5X) has more decoupling going on but not a unifed shader architecture.

    G90 = NV55 (or a revised NV5X refresh) still not a full USA

    then, NV60 (call it G100 if you will) a full USA


    p.s.

    I hope that NV50~G80 has 12 Vertex Shader 4.0 units. having 10 VS would'n't be much of a leap.

    Increasing by 2 Vertex Shaders each generation increases the raw geometry performance less and less (i.e. NV20 (1x VS) to NV25 (2x VS) was a big step but but going from 6 VS to 8 is not).
     
    #72 Megadrive1988, Mar 12, 2006
    Last edited by a moderator: Mar 12, 2006
  13. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,723
    Likes Received:
    242

    true, agreed. G71 is still NV4X, whereas G80 is NV5X, thus a real new generation.


    G71 is still based on technology that came out in 2004, and thus, an architecture that was architected in the early part of this decade, well before NV30-GeForce FX came out.
     
    #73 Megadrive1988, Mar 12, 2006
    Last edited by a moderator: Mar 12, 2006
  14. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    215
    Location:
    Uffda-land
    http://today.reuters.com/business/newsArticle.aspx?type=technology&storyID=nN15213119

    "Prototype" by the end of the year, and they've already said G80 will be out before the end of the year. So I'm leaning towards sticking a fork in the possibility (slim anyway, in my estimation) that G80 is 65nm.
     
  15. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    A prototype might be produced 6-12 months before production, though.
     
  16. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,244
    Likes Received:
    3,408
    G80 will probably be on 90 or 80.
     
  17. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,723
    Likes Received:
    242
    so.....

    initial G80 on 90nm or 80nm later this year, then a 'G81' speedbump on 65nm in early 2007 ?
     
  18. EasyRaider

    Regular

    Joined:
    Oct 1, 2002
    Messages:
    431
    Likes Received:
    2
    Location:
    Norway
    Carmack stated that being able to cache results between passes (for shadow map rendering) would improve performance more than doubling the amount of vertex units. As such, I think 8 will be enough, at least if NV also reduces vertex texture latency by a lot.
     
  19. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    D3D10 also requires geometry shading.
     
  20. SugarCoat

    Veteran

    Joined:
    Jul 17, 2005
    Messages:
    2,091
    Likes Received:
    52
    Location:
    State of Illusionism

    i dont think we'll see any 65nm graphics cores until mid 07 at the earliest. ATI and Nvidia both seem content on riding through most of this year on 90nm let alone the half node. At 65nm things get very complex, same goes for all shrinks but the smaller the harder. TSMC doesnt even have a 65nm fab afaik yet. And major chip firm AMD has yet to make the leap as well and will be releasing its new platform chips continuing on 90nm through most of this year (just to point out its not a walk in the park). People seem to jump the gun too much when it comes to GPU fab size..
     
    #80 SugarCoat, Mar 19, 2006
    Last edited by a moderator: Mar 19, 2006
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...