NVIDIA Fermi: Architecture discussion

Discussion in 'Architecture and Products' started by Rys, Sep 30, 2009.

  1. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    :oops:I messed it up there.
     
  2. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    This is a special case, though, as you have monsterous register use. Since you don't have enough threads to hide latency for a fetch immediately followed by an ALU clause that uses the data, the ordering of the fetches is very important.

    Number of threads has to do with number of registers per hw thread (i.e. registers per fragment * fragments per hw thread). It doesn't matter what the ALU:TEX ratio of the SIMD engine is. If the hw thread size grows to 128 fragments, then I agree (only for programs with extremely high register use, though), but I doubt ATI is going to do that because the branching granularity gap with Fermi really starts to get wide.

    Fermi did not increase the wavefront size, but decreased TEX throughput per SM. That's why it can reduce number of registers per SM without hurting latency hiding.
     
  3. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I still say that when you multiply pixel count by 10 and geometry count by a few hundred, it's going to skew the distribution towards a greater percent of small triangles, not lower. It's not like artists only used more polys on world geometry and kept the same low poly enemies.
    Not when we're talking about geometry bottlenecks, because triangles come in clumps that are similar sized (often zero sized), so you can't buffer out this inefficiency. You can't process those millions of small triangles while having the rasterizer and shading engines work on the large triangles, because the workload just doesn't arrive in such a neatly interleaved fashion.

    So if you want to know how many cycles you will be limited by geometry processing or setup, number of triangles that are small is the important factor.
     
  4. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    Regarding GTX 400s pixel throughput, the puzzle seems finally solved. Over easter holiday I've had a rather lengthy email conversation with Nvidia Tech PR and some PMs with Damien who kindly shared his results he was getting.

    The bottleneck, as it seems right now, is the connection between shader engines and ROPs which is tailored to accomodate 32 pixels of 32 bits at a time. Formats like RGB9E5 or RGB10A2 and the like take up as many slots as fully blown FP16 pixels, thus being half rate her also. I can only guess at what the connection itself will bee, but it seems like it can (for each pixel of theoretical throughput) operate on four lanes of 8 bits at a time. If they exceed 8 bits, like in RGB9E5, it's taking double time, either serialized or with paired lanes (and then 2 by 2 serialized). More than 16 Bits and we go four cycle/ groups of four.

    One-Channel FP32 pixels seems to be able to only occupy a single time slot for all four lanes, so maybe this is the base unit and can be split two- and four-way really.

    I wrote somewhat lengthier (and in german) about it here: http://www.pcgameshardware.de/aid,7...-Fuellraten-Raetsel-geloest/Grafikkarte/Test/
    But the gist is a I said above. Hopefully Damien will follow soon with his update and maybe he can do some tests with two FP16 channels…
     
  5. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,015
    Likes Received:
    112
    Interesting. I always thought format conversion happens in ROPs but that suggests there's some logic for that somewhere at the end of the pixel shader pipe.
    It wouldn't explain the slow 4-channel fp32 blend result but maybe this is really something like 1/4 full speed per channel for the blender, so 1/16 of nominal (34GPix/s) ROP rate. At least that would fit all fp32 blend data for both GTX285 and GTX480...
    A bit strange that you'd have 48 rops but for most things they aren't any better than 32, though maybe they are very cheap anyway.
    Maybe it's time to spend those transistors getting color data back to shader clusters instead and doing blending and writing back through some generic memory controller (which still would handle color compression), that rop design sounds kinda lame. Well for color at least.
     
  6. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    3,984
    Likes Received:
    34
    The way Carsten described the issue it sounded like a bandwidth problem.
     
  7. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,015
    Likes Received:
    112
    Sure yes. Still lame design.
     
  8. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    Well it's obvious why they aren't better than 32 for most things but I'm trying to understand if they're better than 32 for anything :???:
     
  9. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,015
    Likes Received:
    112
    Well if fp32 blend is really quarter-speed per channel that would at least be good for a bit faster fp32 blend rate... Of course that would just be 48 incredibly slow fp32 blend units vs. 32 incredibly slow fp32 blend units but at least that's something...
    Also the increased z fill numbers - they still are way below what should be possible as far as I understand, but maybe the big increase there compared to GTX285 is also (at least partly) due to that.

    edit: so here's what I think these chips can do with blending per rop, if they had enough bandwidth:
    Cypress (probably all Evergreen):
    - full rate int8, fp10
    - half rate fp16
    - quarter rate fp32
    Except fp10 those would be same as rv7xx

    Fermi:
    - full rate int8
    - half rate fp16, fp10
    - quarter rate (per channel! hence 1/16 for 4 channel) fp32
    Those would all be the same as gt200
    All numbers except the fp32 ones would be limited by that 32 (at 32bit) pixel shader->rop connection, hence be the same for 32 or 48 rops
    (btw does that 32 pixel number go down with cluster count? If that's just a bandwidth limitation it shouldn't right since other clusters can just send pixels down more often?)

    Without blending it would be:
    Cypress:
    - full rate int8, fp10, fp16
    - half rate fp32
    Again same as rv7xx except fp10

    Fermi:
    - full rate int8
    - half rate fp16, fp10
    - one fp32 result per clock, that is full rate single-channel fp32, quarter rate 4-channel fp32
    All numbers (without exception) limited by the 32 pixel (at 32bit) shader->rop connection, hence the same for 32 or 48 rops

    That theory is on shaky ground... I won't even touch z fillrate here...
    Some 2-channel fp32 numbers, please...
     
    #3949 mczak, Apr 8, 2010
    Last edited by a moderator: Apr 8, 2010
  10. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    That theory makes sense but at a high level the measured performance seems to track more closely to shader throughput than anything else. This is based on Damien's numbers:

    [​IMG]
     
  11. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,015
    Likes Received:
    112
    Right. So it's still a mystery...
    btw we're always talking about 32 pixel (per clock) rasterization limit. But that apparently doesn't affect z fill rate (neither for Cypress nor GF100), so is this only true for pixels actually going to the shader core? I'm wondering what this actually measures...
     
  12. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    I don't really understand what's going on with z-fillrate in general. Can someone confirm how we get high z-fillrates even when AA is not enabled? Are rasterizers actually capable of producing more depth samples than color samples per clock?
     
  13. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    It's been that way since the NV3x, at least if you switch off color fill.
     
  14. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    Yes. I've touched on this in the Cypress architecture piece. There are a number of simplifications that are possible in that area when doing Z-only rendering, and both IHVs leverage this.
     
  15. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    Yeah it was a silly question to begin with. Just realized the rasterizer has to be able to generate that many depth samples for AA anyway.
     
  16. FenderBender

    Newcomer

    Joined:
    Aug 10, 2007
    Messages:
    45
    Likes Received:
    0
    New subtopic:

    How much work would it be for a Fermi followon to add preemptive multitasking support?

    The hardware already has multiple simultaneous kernel execution support. It already has a cache and registers can spill to that cache. It already has a powerful scheduler. It even has the luxury of descheduling a running kernel block without any runtime speed loss as other blocks are already using the resources, so there's no context switch penalty.

    So as I see it, the main feature needed for preemptive multitasking is the ability to save and restore context, meaning program counters for each warp, predicate masks for each warp, registers, and shared memory. That's a pretty big context, but remember that unlike a CPU, the hardware doesn't need to stall while saving or restoring this context.. it just needs to be descheduled then the bundle of context saved to device memory. (The word "just" here may hide a big job, though.)
    In fact, with the uniform 64 bit memory space, even the biggest part of the context, the shared memory, may not even need to be explicitly saved.. it could just be pushed out and restored lazily by treating it exactly like dirty L1 cache lines... the shared memory hardware IS the L1 cache so this functionality is already there.

    So, am I missing something else that would be needed for kernels to allow for dynamic task switching? I'm thinking of all kinds of obvious applications like letting a physics kernel cede way to a graphics kernel, and only use the GPU throughput when it's otherwise idle. Or for that matter, run CUDA jobs and still use the GPU for the graphics display.

    You could also imagine it being useful for having lots of background tasks idling away, waiting for work. Your particle system code would always be running, waking up only occasionally for data to process.. if there wasn't any it'd announce that it was going back to sleep for a while until a timer (or explicit event trigger) reschedules it. Note that this means the CPU is not involved at all!
     
  17. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    From my understanding of what I've been told, multiple concurrent kernels on GF100 only refers to kernels of the same type, i.e. different physics solvers for cloth, fluid and so on. They all have to, however, to the same operational mode/context, i.e. Cuda or graphics. Not sure about DX compute shaders though.

    Currently, heavy use of Cuda-kernels will still bring the Windows GUI (Win7, Aero G.) to a crawl.
     
    #3957 CarstenS, Apr 11, 2010
    Last edited by a moderator: Apr 11, 2010
  18. FenderBender

    Newcomer

    Joined:
    Aug 10, 2007
    Messages:
    45
    Likes Received:
    0
    No, in CUDA, it's actually arbitrary kernels.. any SM could be running up to 4 different kernels all at the same time. What it CAN'T do is suspend one of those kernels, swap out its state, and swap in a new one for a different kernel, then resume by reswapping.

    You're right that the GPU is in CUDA or graphics mode though. But if you could context switch, the CUDA kernels could be swapped out, the graphics run to paint the frame, then CUDA kernels swapped back in.

    This switching ability isn't in Fermi now (at least nobody has even hinted at it) but my question is mostly about how hard it'd be to add since the hardware already can do most of the substeps of context switching.
     
  19. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,806
    Likes Received:
    473
    They could have the driver put in extra conditional jumps in the shaders for cooperative multitasking (ie. if task switching flag is set save all the context and end the kernel).
     
  20. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    That was what i meant by "multiple concurrent kernels on GF100 only refers to kernels of the same type, i.e. different physics solvers for cloth, fluid and so on. They all have to, however, to the same operational mode/context, i.e. Cuda or graphics"

    Right - it doesn't seem to be available in hardware, otherwise Nvidia would have boasted with it too. AFAIK they only went on about having reduced context switching for the whole chip to 20 mikroseconds, which is supposed to be a couple of times faster than with previous geforce cards.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...