NVIDIA Fermi: Architecture discussion

Discussion in 'Architecture and Products' started by Rys, Sep 30, 2009.

  1. MarkoIt

    Regular

    Joined:
    Mar 1, 2007
    Messages:
    392
    Likes Received:
    0

    They aren't talking about 2 ALUs separated ALUs for DP and SP operations in their whitepaper. I still believe it's 2 flops/clock for each core.
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    That will work. But you lose all scheduling granularity between the producer and consumer. They're now 1:1, which means that register allocation for both is carried by either.

    Though D3D11 has features for tackling resource allocation for this "uber shader" type problem, with dynamic linkage. But I think that's still static at run time, so I don't think it solves this problem.

    Jawed
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Yeah I agree with this, now.

    NVidia's instruction scheduler allows these instructions to be issued out of order and irrespective of thread. So if the INT functionality was a separate SIMD (like SFU is) then there'd be no problem.

    I'm basing 16 colour on HD5870's 32 colour, at ~ half the likely clock of Larrabee. Larrabee might be as fast as 2GHz, of course. As for Z, HD5870's 4x rate seems to be more than adequate, too. NVidia's 8x rate is clearly wasted on GT200. Though a real 8x rate would be useful. Of course on Larrabee the absolute Z-rate is down to whatever else the hardware's doing.

    See MfA's idea: you have an uber-kernel that does both sides on a GF100 core - or you split the kernels across the cores, some are producers and some are consumers.

    It seems GF100 doesn't support context switches per core, only per GPU.

    R600 supports up to 8 states at a time. It seemingly uses these to do multiple concurrent contexts, but the documentation is vague. AMD, according to TechReport, has claimed multiple kernels per core on R800:

    but I'm doubtful that's multiple compute kernels, merely multiple graphics kernels. i.e. VS and PS kernels can run on a single core (like they do on R600, I presume), but I'm dubious (until I see documentation that confirms otherwise) that two or more compute kernels can timeslice on a core. Maybe the 8-state support that is in R600 for graphics kernels has been extended to compute kernels.

    Jawed
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    SM4 requires 4096 vec4 registers per pixel, 64KB.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    How?

    Jawed
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Meanwhile JHH says:

    http://www.hardforum.com/showthread.php?t=1456146

    So, erm, bigger than GT200?

    Jawed
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    DP Divide on R600/R700 is 12 cycles, it's a "macro" effectively.

    Jawed
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Yeah, Intel's chosen a pretty good time for Larrabee as fixed-function texture decompression/filtering is unlikely to need to progress to any great degree beyond what's in D3D11. Nothing else new seems likely to be necessarily fixed-function for decent performance.

    Jawed
     
  9. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    IIRC its already exposed under DirectCompute. And as pointed out some support will come on OpenCL sooner as well.
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    At most TS can produce 32 new points per input patch. But the hardware probably can't rasterise more than one resulting triangle per clock, i.e. 1 new point per clock is the actual TS rate required. Though you can argue more if there's culling of various types to do (back-face, screen-clip).

    DS throughput is also going to be a potential bottleneck, i.e. there's a lot of work to do to convert a point into a vertex - lots of interpolations, at least.

    On GF100 at say 750MHz, and assuming it can rasterise 750M triangles per second, there'd be 1024 scalar operations per triangle (assuming ALU clock is twice core clock). So I can't see how software tessellation is going to be meaningfully constrained.

    I don't understand why AMD implemented ALU-based interpolation (deleting SPI) but kept fixed function tessellation. The only thing I can think of is that it's a functional block that also does vertex/geometry assembly (to feed setup) and its deletion will come when the architecture is properly overhauled.

    Jawed
     
    #410 Jawed, Oct 4, 2009
    Last edited by a moderator: Oct 4, 2009
  11. Tim

    Tim
    Regular

    Joined:
    Mar 28, 2003
    Messages:
    875
    Likes Received:
    5
    Location:
    Denmark
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    It'd be great to see some justification for doing so.

    Jawed
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Those terms seem much better than the transistor count-based scaling idiocy I see everywhere I look. Also, I don't see how I could make the caveat stronger.

    Jawed
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Until we know what's happened with TMUs and ROPs, it's pretty murky. And the DP is definitely a monster improvement - though such a low base blunts that somewhat.

    It'll be interesting to find out how much of a response to Larrabee it was - since Larrabee's been rumoured/outlined for quite a while now.

    Jawed
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Isn't it optional? Slide 32:

    http://developer.amd.com/gpu_assets/Your Game Needs Direct3D 11, So Get Started Now.pps

    Michael Chu clarified:

    Curious why DirectCompute's optional DP is getting much higher priority than OpenCL's optional DP.

    Jawed
     
  16. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,244
    Likes Received:
    3,408
    I'm sorry but what's all this italics and underlines? I don't remember NV ever saying anything in the lines of "sooner than you'd expect" and don't remember them aknowledging that it's late. They would want to have it right now of course but that doesn't mean that it's late. A1, August. If that chip was A3 and was done in Spring then you would have a reason to say that it's late. Right now it's on track. Whether that track is late in itself is another issue.

    So you're now trusting Fuad again? -)

    I don't know will this number be enough or not. I'm just saying that TMUs are neccessary for compute also, not just for graphics.
    I don't understand that stance that Fermi is good for compute and bad for graphics. Most of what's needed for compute is needed for graphics too. If anything, Fermi should be better for graphics than previous generation architecture.
     
  17. elsence

    Newcomer

    Joined:
    Aug 31, 2009
    Messages:
    80
    Likes Received:
    0
    I have a question.
    I see that there is a lot speculation in the net, regarding what performance, Fermi based designs are going to have in games (in relation with the old GT200 or the DX11 5870)

    I guess the logic thing for Nvidia is to have only one Fermi design for both the Tesla market and for Gaming market. (cost/time/resources related issues)

    The potential Tesla TAM according to NV in the next 18 months will be something like $1,2 billions.
    The total revenue for NV for 18 months is close to $5 billions now (in the recent past it was $6 or more).

    If Tesla accounted for less than 1.3% of NVIDIA's total revenue last quarter and this is indicative for all the quarters, then the Tesla revenue was something like $65 million. (18 months)

    So what i am asking is this:

    Is it impossible for NV to have 2 designs, if NV thinks that the Tesla revenue will be increased 5 fold for example?
    ($325 million, a little more than 25% of the potential Tesla TAM)

    Also does anyone know if and in what percentage there is going to be a performance hit regarding ECC implementation? (ECC is helping scientific sectors but i don't think it matters in gaming, even if the GDDR5 leading to errors in higher degree than before, i suspect this isn't an issue for gaming applications)

    I am worried also for FP64 perf.
    Why the gaming part to have in this degree dedicated transistor space to FP64 performance in the gaming sector?
    Isn't more logical for Nvidia to used the transistor space in a more efficient way for the gaming sector?
    The certain thing imo is that at least the DX11 value parts are not going to have these features and FP64 ratios.
     
    #417 elsence, Oct 4, 2009
    Last edited by a moderator: Oct 4, 2009
  18. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Now tell me if that statement makes sense given these numbers below? Here I used relatively conservative clocks of 650/1300 for Fermi and the currently rumoured 128 TMUs. I counted only MAD flops, adjust as required if you consider the "missing MUL" useful.

    [​IMG]

    Does that look like they've abandoned graphics keeping in mind the lowball clock estimates?
     
  19. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    That's far riskier and more expensive than their current approach. People hype the bigger die sizes but only Nvidia knows how much that hurts the bottom line in the end. Also much of this is financed by "cheap" dies like G92 and lower where they are very competitive. In the end the big dies on the high end may not be as big a deal as commonly thought and it's a far easier proposition for them to leverage that investment in multiple markets.
     
  20. elsence

    Newcomer

    Joined:
    Aug 31, 2009
    Messages:
    80
    Likes Received:
    0
    Yes, i agree.

    That's why i wrote:

    "I guess the logic thing for Nvidia is to have only one Fermi design for both the Tesla market and for Gaming market. (cost/time/resources related issues)"

    I just like surprises.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...