NVIDIA Fermi: Architecture discussion

Discussion in 'Architecture and Products' started by Rys, Sep 30, 2009.

  1. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    A film effect is plausible, but I'm not so sure you'd get the same pattern, especially the symmetry, and sharp edged transitions.
     
  2. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    Well you also have to factor in material and height differences. The die photo in question likely was taken from either a mid-metal layer stop point or a de-layered die. Die photos of actual finished dies aren't that interesting cause all you see is the top metal layer which in a C4 design is basically a lot of square pads in a regular array.
     
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    In addition to that, isn't it also the case that the PR photos are "made pretty" with photoshop in addition to any special lighting or filtering that goes into taking the pictures?
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Ah yes, indeed, double-counting patch-edge vertices is my mistake.

    I thought you were trying to suggest that this was involved in making vertices on common edges align, but you were merely referring to another factor.

    You're assuming that the hardware can put multiple triangles into a hardware thread or even that multiple triangles can occupy a fragment-quad.

    Going back to R300, ATI's architecture is based on small hardware threads. I'm still unclear on the actual count of fragments in R300's threads, whether it's 16 or 64 or 256 etc. But compare this with NV40 which we know has a monster hardware thread allocation across all pixel shading SIMDs, running into thousands for the simplest case of minimal register allocation per fragment.

    Why, in this era, would ATI support multiple triangles per hardware thread, when hardware threads are small and when there's so few small triangles, ever?

    R520 has a hardware thread size of 16. All the later high-end GPUs have grown this as an artefact of the ALU:TEX increases, so that we now stand at 64. For games in general and for moderate amounts of tessellation, 64 is fine, because average triangle sizes are large enough to occupy a significant portion of all pixel shading hardware threads.

    But the point is, the basic architecture has been the same all along: SPI creates a thread of fragments' attributes at the rate of 1 or 2 attributes per fragment per clock, for a single triangle at a time.

    Cypress deletes SPI and so some of SPI's responsibility for controlling register allocation/initiation has been dumped on the overall thread control unit. Now LDS has to be populated with barycentrics and attributes, for on-demand interpolation by fragments.

    My theory is Cypress was due to have a multi-triangle, variable-throughput, thread allocator/LDS-populator. That, perhaps coupled with other changes to tessellation/setup/rasterisation, would have provided the tiny-triangle heft. But that was all dropped.

    Since I believe that single-pixel triangle tessellation is a strawman, I wonder why I'm here, frankly.

    Anyway, any decent adaptive tessellation routine will cull patches based on things like back-facing, viewport, occlusion querying, so that the your strawman of 50M triangles before culling is irrelevant. These approaches to tessellation will make it practical for multi-million rasterised triangles per frame.

    Overdraw is always going to be a problem, even with a deferred renderer.

    The B3D graph shows that the longest draw call is ~28% of frame time. I don't know the frame rate at that time, but let's say it was 20fps, 50ms. Assuming 1.6 million triangles in 14ms (though that could have been 1M triangles in the same time, the article is very vague), that's 114M triangles per second coming out of TS.

    On a close-up of the dragon the frame rate is about 45fps without tessellation. That implies a very substantial per-pixel workload - something like 260 cycles (2600 ALU cycles) per pixel assuming vertex load is negligible - and without knowing what proportion is full-screen passes. Anyway, with tessellation on, it doesn't require much of a drop in average fragments per triangle to kill pixel shading performance.

    One of the factors here is we don't know what Heaven's doing per frame. Their site talks about advanced cloud rendering, for example. That'd be a workload that's unaffected by tessellation.

    One of the things B3D could have done was to evaluate performance and draw call times with tessellation but no shadowing. I'm not sure how the night time portions of Heaven work, whether there's any shadowing involving tessellated geometry.

    Ah yes, it was so easy and obvious, it's a feature of Cypress :roll:

    Jawed
     
  5. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,802
    Likes Received:
    473
    Location:
    Torquay, UK
    I'm aware of that but I was under impression metal layers use some transistors for buffers and to amplify signals. Or are these in logic layer and wires need to go back to bottom in order to use them?
    It's nice t learn new things on forum, so thank you in advance for answer!
     
  6. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,012
    Likes Received:
    112
    Still trying to make sense of the ROP throughput...

    One thing which is quite remarkable and I almost overlooked from Tridam's numbers is the ZSamples/sec throughput. Now, from G80 on nvidia chips always could in theory do 8 ZSamples/clock, though were never really close to their peak (for non-aa it was impossible anyway due to bandwidth limits).

    But look at Tridam's numbers:
    http://www.hardware.fr/articles/787-5/dossier-nvidia-geforce-gtx-480.html

    The HD5870, which can do 4 ZSamples/clock, reaches almost its peak rate, but only at 8xAA. With no AA,, it is quite below that, apparently due to bandwidth restrictions. The GTX285 is not anywhere close to its theoretical peak, and for some odd reason I don't understand actually has its peak at no AA - on the upside though it is higher with no AA than on HD5870, suggesting in this case (as the bandwidth is almost the same) it has a bit better z buffer compression (maybe the z buffer compression ratio simply doesn't increase with higher AA?).
    But look what happens with GTX480. Even though the numbers still are nowhere near their theoretical peaks (if we believe nvidia and rops run at 700Mhz, that would be 270GSamples/s), it reaches twice the throughput of the HD5870 with no AA, and still 1.5 times more with 8xAA, with only a little more memory bandwidth than both GTX285 and HD5870. Suggests to me that GF100 has much improved z buffer compression compared to previous chips, it still doesn't really seem to scale with increasing AA (hard to say though as the numbers at least increase a little and maybe the ROPs just can't reach their peak in any case) but almost twice as good with no AA than what HD5870 has.

    Now the color fill results are an entirely different affair and most of them (if there really are 48 rops running at 700Mhz) simply pathetic, but it's at least something...
     
  7. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    ALL transistors of every kind are in the logic layer.
     
  8. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    it isn't unusual for them to apply various obfuscation filters to the image esp if the die photo is well before release.
     
  9. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    As said, the photos are taken typically from some of the lower metal layers. The almost regular pitch between the lines there work as a diffraction grating. That means, different colors can be seen under different angles (as it is illumated with white light for the shots). An optical diffraction grating with a sub micron pitch (several thousands lines per mm) shows fairly similar coloring (but without the structures of course) when one looks with the bare eye on it. Or just take a CD or DVD! ;)
     
  10. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    That was my intuition, I just think it's cool that, unlike a CD where the patterns are smeared out, presumably due to, due to error correction, data compression, and other distribution effects which randomize things (so most CDs have a uniform diffraction fringe effect), the chip cores actually have a lot more symmetry, in a way, almost organic. Maybe I'm the only one that sees the beauty and art in this, but I find these die shots pleasing to the eye, on both the color and geometric pattern levels.
     
  11. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Yes, they go back to the bottom.
     
  12. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    Z-sample rate for a 700-MHz-part should translate to 179.2 GSamples/s. They said pretty clearly they could do up to 256 zpc if data's compressible.

    http://www.pcgameshardware.com/aid,743526/Some-gory-guts-of-Geforce-GTX-470/480-explained/News/
    " For example in 8xAA, the peak GPC output rate is 32*8 = 256 samples per clk, whereas the peak ROP rate is 48 samples per clk."

    No, they aren't. Color fill is 30*700=21 GPixe/s for the GTX 480 for example, since it's not the ROPs any more that are limiting performance but the bottleneck is ealier in the pipe. That's a move away from standard routes, granted.
     
    #3912 CarstenS, Mar 30, 2010
    Last edited by a moderator: Mar 30, 2010
  13. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    And why does a GTX480 only 10 GPixel/s with the RGB9e5 or the FP16 color format? I don't see where this could be limited earlier in the pipeline.
     
  14. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    Half-Rate? You cannot split individual pixels to different ROPs, and each ROP takes two cycles to process that (and FWIW FP16) format.
     
  15. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    You wouldn't have to split individual pixels. A GTX480 can spit out 30 pixels per clock (at 700 MHz) to 48 ROPs. I'm quite sure that on average they get evenly distributed to the ROPs (otherwise 48 ROPs would have zero benefit compared to only 30). So half rate for those formats in the ROPs would also imply the ROPs are limiting it to about 10 GPixel/s. Half rate for 48 ROPs @ 700MHz would yield you 16 GPixel/s.

    That reminds me of the strange rumors in january (I think Xman was talking about that) involving a core clock of only 450 MHz or something like that. And if you look at the numbers today, it appears to fit, or is there anything speaking against it besides nvidia's own word?
     
    #3915 Gipsel, Mar 30, 2010
    Last edited by a moderator: Mar 30, 2010
  16. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    Right, it would fit. My mistake - i was thinking of 18 ROPs running idle after a while, not taking into account that the shader engine doesn't have to be half-rate.
     
  17. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,428
    Likes Received:
    426
    Location:
    New York
    Is that 30 number from the interview? How do you lose 2 pixels due to the disabled SM? Rasterization is supposed to be a GPC level function. I don't get it. :???:
     
  18. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Maybe a SM can only export two pixels per clock to the ROPs?

    Btw., I still think there are no such things as GPCs ;)
     
  19. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,012
    Likes Received:
    112
    Yeah we were talking about this before, I still can't make any sense of it. There are 48 rops but they either act like 48 rops at ~450Mhz (but nvidia claims they have 700Mhz clock, and OC core also increases ROP throughput, though a fixed divider might be a possibility) or like 32 rops at ~700Mhz (which is the same essentially). CarstenS is right though about 256 zpc max since even for 8xAA this is single cycle hence limited by rasterization limit. And since nvidia claimed that 32 pixel throughput limit goes down with disabled SMs that would make it only 240 zpc - hence the measured rate would indeed be close to 100% theoretical rate. So maybe nvidia has killer compression for high AA levels too it might just not show up in this test, but in any case compression really looks good even with no AA.

    That's what I was trying to tell you in about a dozen posts :)
     
  20. mboeller

    Regular

    Joined:
    Feb 7, 2002
    Messages:
    922
    Likes Received:
    1
    Location:
    Germany
    Wacky idea: maybe not 450MHz but 462MHz. This would be 924MHz/2 ??
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...