The G92 Architecture Rumours & Speculation Thread

Discussion in 'Architecture and Products' started by Arun, Aug 8, 2007.

Thread Status:
Not open for further replies.
  1. Arun

    Arun Unknown.
    Moderator Veteran

    Joined:
    Aug 28, 2002
    Messages:
    4,971
    Location:
    UK
    Rumoured Data Points
    - Evolutionary step of the G8x architecture.
    - Supports PCI Express 2.0 (and DisplayPort?).
    - "Close to 1TFlop in 4Q07" according to NVIDIA Exec.
    - FP64 support confirmed (for Tesla only, not GeForce).
    - First iteration will be on 65nm; next ones might be 55nm.
    - Few details are actually known (or rumoured) about the arch.

    Noteworthy Internet Rumours
    "NVIDIA confirms Next-Gen close to 1TFlop in 4Q07" [Beyond3D]
    "Nvidia's next generation is G92" [Fudzilla]

    Thread Discussion Starting Points
    - Does all of this rumoured information seem reliable to you? What, if anything, sounds fishy?
    - Do you believe the ratios of the different units have been changed? How so?
    - What memory bus width and memory speeds are we expecting? And at what price?

    - G8x's ALU ratio is a fair bit lower than R6xx's - do you think this will be changed?
    - What modifications are you expecting in the ALUs? G8x ones were semi-custom, could this be fully custom?
    - The G80 derivatives got rid of 'free trilinear' - do you think this will also be the case for all G9x?
    - Do you think there will be some reuse between this and NVIDIA's future handheld architectures?

    - Outside the 3D architecture per-se, do you think the video core will be refreshed? Do you expect there to be a separate NVIO again for desktops?

    Last Updated on August 9th
     
  2. max-pain

    Regular

    Joined:
    Feb 13, 2004
    Messages:
    309
    Double precision support?
    Arun: Oopsie, added to the list.
     
  3. Twinkie

    Regular

    Joined:
    Oct 22, 2006
    Messages:
    386
    SLi 2.0?

    And maybe the VP2? (the newly updated video engine found in G84/86)
     
  4. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,347
    Location:
    Varna, Bulgaria
    I read somewhere that the co-issue design is to be altered (MAD+ADD). :roll:

    Anyway, about the ALU:TEX ratio, I think NV should take the shortest cut, eg. tuning the PLL of the SP domain to get higher clock for the shaders, if the 65/55nm tech is good enough.

    DP processing is curious one, whether they will introduce new monolith double-wide ALU design, or a fused one, so the SP throughput [for games] would be doubled somehow.
     
  5. Rys

    Rys Tiled
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    3,893
    Location:
    Beyond3D HQ
    The shader clock domain isn't on a separate PLL which can run freely, rather it's a multiplier of the scheduler clock. That's likely to continue (and reminds me that I need to shoot Tony Tamasi for something he said to me on editor's day for the G80 launch!).

    SP speed being doubled assumes that DP runs at full speed for the clock, which I'm almost certain isn't the case. My guess is that DP runs at reduced rate because of datapath limits for operand fetch or the write path.

    As for the SFU and SP ALU structure changing (for the better, per clock), maybe on G92 but possibly not on other G9x implementations. I ran through some basic area/clock bits and pieces with someone earlier, so maybe this time around I'll publish a 'proper' guess at some point, to see how close the final architecture is.

    The memory bus width has been a bone of contention internally (Arun disagrees with what I think notably), but that pre-launch process has started again (we started a lot further out last time, if memory serves).

    The rest of the proposed discussion points I'll leave alone :twisted:
     
  6. ninelven

    Veteran

    Joined:
    Dec 27, 2002
    Messages:
    1,597
    Yes. I think we will see a slightly higher alu:tex ratio.... maybe 25%+


    Not sure I'm expecting any to be honest... more of the same?

    Probably.

    Not significantly.

    I'll put it in percentages of how likely I think it to be:
    90% - 384 @ $500 flagship
    10% - 512 @ $550 flagship
     
  7. cho

    cho
    Regular

    Joined:
    Feb 9, 2002
    Messages:
    416
    6 TPC @ 1.6GHz about 9xxx 3dmarks .

    dual chip(die?) for the "closed to 1TFLOPS" card .
     
  8. Twinkie

    Regular

    Joined:
    Oct 22, 2006
    Messages:
    386
    What do you mean by TPC?

    Is G92 going to be dual MADD? :twisted:
     
  9. AnarchX

    Veteran

    Joined:
    Apr 19, 2007
    Messages:
    1,557
    I think G92 will be 6 cluster and 4 rop-partitions with FP64 support and maybe some other improvements(TAUs = TMUs like G84/G86).

    This should result in a 200-250mm² die, which could be sold with the cheaper 256Bit PCB for a very good price(~$200).

    The performance should also be very good with the clock gains through 65nm, I expect 600-700MHz(ROPs/TMUs) and 1.5-2GHz(SPs), memory could be 1-1.2GHz GDDR3/4. So it should outdo 8800GTS and come near to the GTX.

    In enthusiast segment I think we see a dual-GPU-SKU, which is already announced for TESLA and so I see no problem that it also comes to GeForce:
    http://www.pcper.com/article.php?aid=424&type=expert
    This should be the 1 TFLOP solution.
     
  10. Arun

    Arun Unknown.
    Moderator Veteran

    Joined:
    Aug 28, 2002
    Messages:
    4,971
    Location:
    UK
    Unsurprisingly, it's incredibly tempting to focus just on G92 - I'd like to try to get the discussion to be a bit more about G9x in general though, but heh! :) I guess starting that G92 SKU thread might be a good idea now...

    Anyway, something I like to ponder upon is whether NVIDIA wants to offload more and more to the shader core. Of course, it is hard to predict when that will happen and for what parts of the pipeline, but the obvious short/mid-term candidates are:
    - Triangle setup (~100 flops/triangle).
    - Downsampling (~2 flops/sample).
    - Blending (~20 flops/pixel + RMW).

    There are tons of other candiates but the above ones are probably the only ones that make potential sense for the G9x/R7xx generation. And heck, maybe none of that will happen.

    However, I find it very tempting to presume that triangle setup will be done in the shader core, as this could make the Z/Shadow-passes just ridiculously fast. Blending is a bit harder because of the RMW, but it's nothing astonishing either, it just requires a bit of locking at the ROP or memory controller level. And downsampling, well, it shouldn't be too hard either if optimized for properly but it's also much less important.

    I could definitely see G9x being a small iterative improvement over G80, but with a higher ALU-TEX ratio and more work being offloaded to the ALUs to improve overall scalability. The ROPs would need to be redesigned with that taken into consideration too, obviously.
     
  11. AnarchX

    Veteran

    Joined:
    Apr 19, 2007
    Messages:
    1,557
    To offload some fixed-function tasks to the shaders and to invest the saved transistors in SPs would also make sense in connection with GPGPU(Tesla).

    But will NV do this step already with G92? At the moment I believe it is more an interim solution with low risks, real changes we will see maybe in the next highend-core in H1 2008 (G90/G100?).
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,661
    Location:
    London
    Well attribute interpolation is already in the shader core, so why not, eh? I notice from the latest D3D10 presentations from SIGGRAPH a suggestion that setup be performed in GS - see page 59 of "2 - DX10 Pipeline.pdf". Talks about higher-order interpolation and other fancy stuff to produce higher quality results rather than relying upon the default behaviour of the D3D10 pipeline.

    Required by D3D10 already. I presume you mean AA resolve downsampling, otherwise not sure what you mean.

    I think that's the one NVidia will leave as late as possible. Though it has to be said getting RMW in early would be a boon for CUDA programming where it is, otherwise, quite a logjam as it currently stands (once PDC space runs out Stores/Loads are uncached as far as I can tell).

    To me this is a "latency-hiding" problem - StreamOut simplifies the issue by being write-only, but appears to still incur a latency penalty for VS/GS code that uses SO, during SO writing. And SO has to complete before the buffer being written can be consumed by VS on the next pass. So, in all, SO is a similar problem, introducing latency during the write and requiring strict ordering.

    So, ahem, you could argue that blending will always be stuck with the latency/ordering problem. Maybe this is where we'll start seeing huge caches on die. Don't G80's L2 caches, one per ROP/MC partition, provide this functionality already?

    ATI hardware has twice the setup rate per clock of NVidia hardware, doesn't it? So, NVidia's motivation might be different from ATI's...

    Using a GS to perform triangle setup it should be possible for you to test your assertion about z-/shadow-pass rendering on G80.

    If the Control Point Shader block is coming soon (D3D11?) - feeding into a fixed-funtion tessellator - might that take priority over Setup becoming programmable?

    Jawed
     
  13. Tim Murray

    Tim Murray the Windom Earle of mobile SOCs
    Veteran

    Joined:
    May 25, 2003
    Messages:
    3,278
    Location:
    Mountain View, CA
    G84, which NVIDIA has termed to be Compute 1.1-compatible, has a whole array of atomic functions. I imagine G92 will be at least 1.1 compatible. Take that as you will...
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,661
    Location:
    London
    Aha, so that might be all they do for a while then...

    Jawed
     
  15. Rys

    Rys Tiled
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    3,893
    Location:
    Beyond3D HQ
    Yeah, he means the AA downfilter, but I think he's wondering if it'll be exclusively performed in the shader core and nowhere else on the chip.
     
  16. Arun

    Arun Unknown.
    Moderator Veteran

    Joined:
    Aug 28, 2002
    Messages:
    4,971
    Location:
    UK
    There is no next high-end core in H1 2008. G92 is all you'll get for at least 9 months, and most likely more than that. If G9x is a low-risk incremental update, then there probably won't be a larger refresh until 4Q08.

    Well, that has been the case basically forever! :) It's hardly new to DX10 hardware (although I'm not 100% sure when on-demand interpolation was introduced).

    Triangle setup in the shader core is already done in Intel's latest IGPs, and has been described in an ATILLA paper. The primary advantage is the performance improvements for Z-only passes, including shadowmap generation... Of course, reducing the amount of fixed-function logic is also nice.

    Well, what's required by D3D10 is that the application can downsample the AAed buffer itself if it sees fit. By default, it is fixed-function hardware doing that anyway on G80, AFAIK.
    Well, the L2 caches are hardly 'huge' - it's just 128KiB, or 1/4th of the size of the register file! However, you make a good point: having the data on-die would clearly simplify RMW, as the latency would be less of an issue.

    However, eDRAM-like approaches cannot *guarantee* that the data is always on-die, unless the amount of eDRAM is just ridiculously big or you use tiling. So it most likely would just make the average latency lower - which isn't bad by itself, I'll admit.

    Well, yeah - improving the triangle setup unit *somehow* is more urgent for NVIDIA than for ATI, since it risks being more of a bottleneck. Just offloading it to the shadercore seems like the logical path for them to take, but they might just improve it throughput via more traditional means.

    [QUPTE]Using a GS to perform triangle setup it should be possible for you to test your assertion about z-/shadow-pass rendering on G80.[/QUOTE]You can't really bypass triangle setup if you want rasterization hardware to work in D3D10 - so in terms of performance measurements, this shouldn't tell us what we want to know... :(

    Given the Z performance of the G80, however, and the fact that previous-gen hardware (with the same triangle setup characteristics afaik!) was sometimes already triangle setup limited in Z/Shadow passes... Well, it should be kind of obvious that this is a very real bottleneck on G80, and less so on G84.
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,661
    Location:
    London
    :oops: That makes sense.

    Reading about D3D10.1 changes:
    • user-programmable sample masks
    • MSAA depth-reading
    • per-render target blend mode
    implies there's a meaty chunk of new output merger functionality. Perhaps it's enough to justify cutting-over to shader-based OM?

    Presumably reading depth is the big one? But if that's only possible after a state change, then, erm, maybe not? I'm lost...

    Jawed
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,661
    Location:
    London
    Well it's interesting that ATI has stuck with its fixed function Shader Processor Interpolators. As far as I can tell these are "driver programmable" ALU blocks, running variable-length interpolation programs, dumping results to a block of attribute memory for the pixel shaders to consume. They aren't radically different from shader ALUs, I suppose. ATI GPUs appear to have multiple SPI blocks running in parallel (if the ALU redundancy patent application is to be believed), so I guess they scale with the desired triangle throughput, rather than merely scaling with clock rate.

    I suppose it's a bit like the question, "do you have fixed function TAs? or do you run this as shader instructions?" Why did NVidia make the TAs in G80 fixed function?

    Interestingly, you could argue that a single batch of primitives being setup would mesh quite nicely with a post transform vertex cache - they'd both be "about the same size".

    I need to read the ATILLA stuff closely - I keep forgetting it :oops:

    I didn't put that very well :oops: I wasn't trying to imply that G80's L2 is huge - but if a huge L2 for RMW is coming, then that's where it would be. I guess.

    Yeah, in graphics "caches" tend to find the "right size" extremely readily. So an entire set of render targets will always be out of scope for L2. Anyway, some would argue that the RMW penalty of graphics is what keeps it honest, what enables it to be embarrassingly parallel.

    How much more performance do you think NVidia needs here? 10s of % will come with a clock boost. Orders of magnitude (to match the zixel rate of G80?) may be a step too far?

    I don't know how bad the mismatch is with G80. Arguably G92's mismatch would be lower, anyway (if it has less ROPs) and if 2x G92 is the new enthusiast part, then AFR takes care of this.

    Whoops, yeah, that's going the wrong way on triangle count. Not a good idea.

    Jawed
     
  19. Rys

    Rys Tiled
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    3,893
    Location:
    Beyond3D HQ
    Being able to bind depth in a view in the shader with MSAA on, plus the per-RT blend mode should both be renderstate changes/considerations at the application level. The latter seems more expensive to me at the hardware level actually, especially with the increase in the number of possible RTs that comes with D3D10. Might be wrong, since accessing depth while compression will be on is hardly trivial.
     
  20. Rys

    Rys Tiled
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    3,893
    Location:
    Beyond3D HQ
    Why not? Modern hardware is effectively tiling anyway as a pixel batch makes it way out of the ROP. Cache increases to improve blending performance shouldn't have to be that extravangant? I don't see the OM being any more programmable than it has to be for this architecture anyway, and thus still fixed hardware.

    The front end of the chip seems to benefit more from the general idea than helping to get data out at the end.
     
Thread Status:
Not open for further replies.

Share This Page

  • About Beyond3D

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...