NVIDIA Fermi: Architecture discussion

Discussion in 'Architecture and Products' started by Rys, Sep 30, 2009.

  1. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    That will certainly help with reducing the overhead of having to coalesce access patterns.

    My concern is that, since L1 is not guaranteed to be coherent, there is a risk of cache pollution if a cacheline sis evicted from L1 to L2, gets changed there, and is then brought back into L1 again.
    Neither is 48K per SM. (now that and SM has 32 alu's, fermi actually has lesser L1/alu than GT200's shared mem/alu).

    There is L2, but that is meant for all the SM's, so it works out to only 32K/SM. Not to mention that it will be slower.
     
  2. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    This is true for peak gflops rate using mad/fma (needs all 4 simple alus for a 64bit fma - on rv870, the rv770 couldn't do 64bit fma). It is noteworthy though that for other instructions, e.g. mul or add, the rate is 2/5 of the single precision rate, so still 544GFlops when using only adds or muls as long as the compiler can extract pairwise independent muls or adds (GF100 will drop to half the gflops with muls or adds).
     
  3. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    Technically, a SM shouldn't read or write an address where another SM is reading or writing. That is, between a "sync event." So, if the L1 cache has something evicted, it must either be read (then it shouldn't be written by another SM) or written (then it shouldn't be read by another SM), so there shouldn't be a risk of cache pollution. Atomic operations on global memory is likely not cached in L1.

    The write back of L1 cache most likely only happens when a SM's kernel is completed, or when it issued a memory barrier request. These events can be broadcasted and all SM can be forced to perform a writeback on their L1 cache, into the unified L2 cache or directly to the memory. Therefore, the coherence at the sync points is maintained.
     
  4. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,247
    Likes Received:
    4,465
    Location:
    Finland
    So, what's the verdict on the gaming part, is there enough magic tricks and cookie monsters to really beat a chip with over 1 TFLOPS higher theoretical peak performance, and with notable margin?
    For reference, last gen the peak difference between 4890 and GTX285 was ~0.2TFLOPS
     
  5. KonKort

    Newcomer

    Joined:
    Dec 29, 2008
    Messages:
    89
    Likes Received:
    0
    Location:
    Germany, Ennepetal
    Has somebody already looking to the pack of Fermi?
    According to Anandtech Fermi should be 467 sqmm. If you compare this value with AMD's Cypress (339 sqmm / 2,15 billion transistors * 3 billion transistors = 473 sqmm), Nvidia's pack is as good as AMD's.

    If Fermi will be >= 40% faster than Cypress, you can say that it has got a better performance/die size than Cypress. How times can change... ;)
     
  6. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    It doesn't look to me like nvidia invested a lot of transistors into INT capability. 32bit muls are half rate according to rwt (64bit 1/4 rate, I guess mostly needed because chip can address more than 4GB of ram?), and you can't issue INT and FP at the same time (on the same pipeline), so in terms of implementation sounds pretty cheap to me. Still INT capability is certainly more than decent. Note that rv870 beefed up INT capability too - it can do 4 muls/adds per clock though they are 24bit only (of course that should be very cheap in implementation), plus one 32bit mul/add on the T unit (though the 32bit mul will need 2 instructions on the t unit if you need both high-order and low-order result bits, at least if rv870 is like rv770 in that area).
     
  7. elsence

    Newcomer

    Joined:
    Aug 31, 2009
    Messages:
    80
    Likes Received:
    0
    Let's suppose the design is 700MHz 48ROPs/256TMUs/512SPs and the memory is the same with the 5870's mem (1,2GHz GDDR5). (so 230.4GB/sec with a 384bit memory controller)

    With G92b, Nvidia had a design with 738MHz 16ROPs/64TMUs/128SPs and with 1,1GHz GDDR3. (70,4GB/sec with a 256bit memory controller)

    GT300 has 2,85X pixel fillrate / 3,8X texel fillrate / 3,3X memory bandwidth.

    This logic is kinda simplistic, but things doesn't look so bad if the GDDR5 memory controller has good efficiency.

    Nvidia must do something to reduce the hit with 8xMSAA .

    Anyway, i guess (wild guess) GT300 will be around 1,5X faster per MHz in relation with a 5870 (4X AA).

    The problem with 5870 is that the performance improvement in relation with a 4890 is not consistent. (it has much higher variations in perf. than what 2X specs would normally have, i know about the bandwidth...)

    I don't know why is that, but i guess it is either the geometry setup engine (Geometry/Vertex assempler has same performance with 4890's) or something about the Geometry shader performance?

    I could only find 3DMark Vantage tests, if you check:

    http://www.pcper.com/article.php?aid=783&type=expert&pid=12

    GPU Cloth: 5870 is only 1,2X faster than 4890. (vertex/geometry shading test)
    GPU Particles: 5870 is only 1,2X faster than 4890. (vertex/geometry shading test)

    Perlin Noise: 5870 is 2,5X faster than 4890. (Math-heavy Pixel Shader test)
    Parallax Occlusion Mapping: 5870 is 2,1X faster than 4890. (Complex Pixel Shader test)


    It shouldn't be a problem of the dual rasterizer/dual SIMDs engine efficiency since synthetic Pixel Shader tests is fine (more than 2X) while the synthetic geometry shading tests is only 1,2X.

    Is these synthetic vertex/geometry shading tests so bandwidth limited in order to deliver 1,2X instead of 2X?
    Or the Pixel Shader tests like the Parallax Occlusion Mapping test is not bandwidth limited at all? (why they deliver more than 2X?)

    And anyway, it is not logical for 5870 to be extremely bandwidth limited.
    Why ATI to waste transistor resources like that, if the design is not going deliver (in such a degree) because of the bandwidth?
    Certainly, they would have used the transistor space in a more efficient way.
     
    #127 elsence, Oct 1, 2009
    Last edited by a moderator: Oct 1, 2009
  8. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Yea, why not?
    The GTX295 beats the 5870 aswell, in most games... Doesn't the GTX295 have a considerably lower theoretical TFLOP rating than the 5870 (~1.8 TFLOPS vs ~2.7 TFLOPS)? Not even including the inefficiencies introduced by SLI, that Fermi won't have to suffer from.
     
  9. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Well, anand got his 467mm² by multiplying rv870 die size with the 1.4 factor due to the transistor difference between these chips so it's not surprising you end up with the same density :).
    I think if GF100 is indeed only this big I'd be pleasantly surprised. Both g92b and gt200b required about 25% more area per transistor than rv770.
     
  10. Vincent

    Newcomer

    Joined:
    May 28, 2007
    Messages:
    235
    Likes Received:
    0
    Location:
    London
    Great post and honest comment. :wink:
     
  11. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,535
    Likes Received:
    144
    [ignore]Neither can RV870 AFAIK, only MAD for DP - FMA is SP only (not 100% certain though).[/ignore]

    Seems I made something of a booboo, they actually do FMA for DP as well, so disregard the above.
     
  12. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    elsence,

    Yes the speculative math is more than simplistic and besides that you are comparing a performance chip (G92) with a high end chip (GF100). If anything G92 wasn't designed in my mind with it's bandwidth and memory size constraints for very high resolutions.

    Considering NV's 8xMSAA performance from G80 to GT200 today I've never really understood what the real problem is that there are such differences. 4xMSAA is single cycle and it takes only two cycles for 8xMSAA. Granted through various driver updates things have improved by quite a bit with 8xMSAA for all those GPUs, but they're always a notch behind ATI's equivalents.

    Something might say 8xMSAA performance is bandwidth related and I don't see anything that proves that. I don't see any excessive bandwidth on Cypress and yet the performance drop from 4x to 8xMSAA is rather small. Something else must be here vastly different between the two architectures so far.

    Finally when it comes to any hypothetical bandwidth limitations for Cypress it doesn't sound to be that much different so far for GF100. If there's such a limit then it rather applies for all performance/high end DX11 GPUs then only for one. Besides it's far more important how each architecture handles its bandwidth then the raw maximum bandwidth on paper. W/o any extensive testing results from a GF100 there's nothing we can say about that one yet, nor compare the two families.
     
  13. stevem

    Regular

    Joined:
    Feb 11, 2002
    Messages:
    632
    Likes Received:
    3
    All of these were (x86) cpu substitutes. Nvidia (for the time being at least) is driving complementarity in the x86/cpu space. Sure, they're looking to supplant large traditional clusters while leaving a few breadcrumbs for the cpu crowd. Heh.

    I'd say they just used the same assumptions you did & extrapolated the number... It may be bigger or smaller. ;)
     
  14. KonKort

    Newcomer

    Joined:
    Dec 29, 2008
    Messages:
    89
    Likes Received:
    0
    Location:
    Germany, Ennepetal
    You are right, but in the rumours a similar die size was expected. I think we can say that Fermi will be around 500 sqmm and this looks like a much better performance/die size than GT200.

    PS: I am now sure that Radeon HD 5870 X2 will be faster than Geforce 380 in gaming performance.
     
  15. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    That is 16 _scalar_ interpolations. So 256 in total. But you need to interpolate 2 scalar attributes for 2d texturing, so that's only really good for 128 TMUs.

    edit: oh hmm, forgot about the higher clock of the alus vs. tmus. Still, imho 256 TMUs would make no sense at all, it would increase TEX:ALU ratio from GT200 considerably (back to the G92 level).
     
    #135 mczak, Oct 1, 2009
    Last edited by a moderator: Oct 1, 2009
  16. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Wow. I had just written a fairly long post, then BAM! Blue Screen of Death. That doesn't happen too often nowadays to say the least - just my luck... A couple of thoughts based on some of the strange assumptions I've seen in this thread:

    1) Remember Fermi is the architecture, GF100 is only a chip. The derivatives should be less HPC-centric and therefore have less GPGPU 'overhead'. How much is anyone's guess; ECC support should be gone in some/all of them at the very least and it's not clear to me what level of FP64 support they will have. Also, low-end chips might have a lower ALU:TEX ratio (or even ALU:ROP) and therefore fewer transistors should be HPC or GPGPU-centric.

    2) The FP64 implementation is likely based on the two FPUs being very different; one with 24x24 multipliers and incapable of either INT32 MULs or FP64, and the other capable of either at full speed and benefiting from the first one's buses & RF when doing 64-bit. This seems quite efficient to me, but at the same time it might make it difficult to support slower FP64 in derivatives as you'd then be left without INT32 support. Unless they do it in four cycles in either ALU on those chips and forget about INT64, in which case I'm not sure why they apparently gave up on INT24 support completely... (and this would also imply the FPU-ALU differentiation is once again mostly marketing either way)

    3) TMUs will probably be fairly traditional. Just look at the die size: if we very naively assume that the TMU is the differently colored block in each shader area, then we get to about ~13% of the total die size. Given the greater performance and functionality of the SMs and the fact it's likely more of the 'cluster logic' moved in there, that's not an impossible evolution from the 25% of GT200 and it's also conservative estimate. It's interesting to ponder how NVIDIA could change the ALU:TEX ratio over time in this architecture; unlike in G8x where the multiprocessor count per cluster was the easiest approach, it seems to me that changing the performance of the TMUs is an easier one this time around.

    4) The MC-linked group (which includes ROPs) is fairly large, so there is little reason to assume they've tried to offload as much as possible to the shader core there (although they could have for flexibility reasons; I still yearn for programmable blending). More importantly, it is also therefore possible that these blocks handle most of the triangle setup/rasterization functionality whose performance then scales gracefully between derivatives based on the number of MCs. Since every tile should be dedicated to a MC, it would a sensible location from where to handle conflicts/guarantee correct rendering order. Obviously, that's not a magic fix and I still can't imagine it being easy to implement.
     
  17. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    clock for clock, sp vs sp, flop per flop Fermi should be faster then the gt200 at least from what we have seen, its all theoretically :)
     
  18. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Not really.
    The sales price of videocards or processors in general is not directly related to the manufacturing cost.
    A huge factor in product pricing is also performance.
    There are plenty of examples of 'small' chips which commanded high prices to end-users due to the better performance they had (eg Athlon 64/X2/FX vs Pentium 4/D), or 'large' chips which were cheaper than smaller chips, because the performance didn't allow higher prices (eg Athlon X2/X3/X4/Phenom vs Core2).
     
  19. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
  20. Sxotty

    Legend

    Joined:
    Dec 11, 2002
    Messages:
    5,497
    Likes Received:
    867
    Location:
    PA USA
    The newegg price is a lot better indication of whether I should give a crap about die size.

    What I wan to to know actually of all the weird things is how tesselation performance will be between the two since the nvidia one is supposedly more software oriented. It doesn't necessarily mean it will be worse, but it certainly hints that at that particular task it may suffer. I was hoping that since it was finally in DX it would become more commonly used.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...