Nvidia Pascal Announcement

Discussion in 'Architecture and Products' started by huebie, Apr 5, 2016.

Tags:
  1. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    288
    Likes Received:
    189
    Maybe it's me, but that conflicts directly with your other statement. It looks like in practice it's 960 vs 1920 ALUs, so the same instead of 2/3rds. And in fact, it looks like registers are the limiting factor in both cases.

    And my reasoning is that with a double, you can hold half as many operands. Even if you only have half the ALUs to feed, I'm pretty sure that trips to memory are going to be more frequent and you'd again only fetch 1/2 the amount of operands, decreasing the probability that what was copied along would be required in the immediate future, thus requiring another access sooner than you would with single precision. I'm not talking about massive increases, but I'm pretty sure the bandwidth requirements are higher.
     
  2. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,943
    Likes Received:
    2,286
    Location:
    Germany
    Where can one see „some presentation“? I have yet so see a presentation with so much detail as instruction issue rate wrt Pascaal.
     
    Razor1 likes this.
  3. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,943
    Likes Received:
    2,286
    Location:
    Germany
    What's DPFP power worst case still has headroom on SPFP, where with re-used operands more ALUs could be utliized. Thus, the potential power draw in SP-mode is higher and it makes no sense to limit frequencies for DP-mode only. That's my train of thoughts in a nutshell.

    So you're referring to memory accesses? As soon as you're having trips to memory occuring in order to relief register file limitation, chances are, you are not nearly running at your occupancy, thus power limit.
     
  4. Nakai

    Newcomer

    Joined:
    Nov 30, 2006
    Messages:
    46
    Likes Received:
    10
    [​IMG]

    I hope deep linking on computerbase is allowed. But this is the slide, I was referring to.

    So how does that concur?
     
    Kej, gamervivek, Razor1 and 1 other person like this.
  5. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,506
    Likes Received:
    424
    Location:
    Varna, Bulgaria
    Each scheduler in the SM has 16xFP64 lanes attached to it, so a single warp would need two cycles to complete. Maybe that's the meaning of the issue rate for DP. :???:
     
    trinibwoy, CSI PC and Nakai like this.
  6. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,943
    Likes Received:
    2,286
    Location:
    Germany
    What you have to take into consideration is the thread group/warp size of 32 that stays the same between SP and DP as well as the difference between latency and throughput.
     
    Nakai likes this.
  7. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    Perhaps issue rate is being used as a proxy (or physical layout is being used as a proxy). We've been talking about the fp32 units being virtually split for fp16 math, but one would not claim that there are twice the fp16 units running at twice the rate, or 4x total throughput. It's possible that the issue rate is half rate, but that's the issue rate of a full-width set of ops, and not the half-width native to the fp64 hardware?

    [heh, three people said at the same time]

    Maybe a more interesting question -- from where can the pair of fp16 ops come from? Different warps? only consecutive operations in same warp?
     
    ieldra and Nakai like this.
  8. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    288
    Likes Received:
    189
    Not so sure regarding power limit. Nvidia's focus on data locality over the years as a means to increase efficiency and the obvious results over the same years, IMO speak for themselves. Considering the results I don't think all of that Bill Dally and company talk was just, well, talk.
     
    nnunn likes this.
  9. Nakai

    Newcomer

    Joined:
    Nov 30, 2006
    Messages:
    46
    Likes Received:
    10
    Thats very plausible. So there are no 2x8xFP64 units per scheduler, but 16xFP64? Then this makes some sense.
     
  10. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    387
    Likes Received:
    394
    Pretty sure that's a VLIW2 op. So that doubled FP16 rate is more of a theoretical maximum.

    Would be interesting to know if memory access needs to be aligned (same / packed operands).
     
  11. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,943
    Likes Received:
    2,286
    Location:
    Germany
    Of course it is much more expensive energy-wise to have data fetched from system or local off-chip memory, no doubt about it. Especially, when you ignore the hundreds of cycle latency penalty you pay for this and pretend that your ALUs are running full throttle in the meantime anyway. Reality though, is that it is much more likely that whenever you cannot feed your ALUs from the register file, nor from L1 cache, nor from L2 cache but have to take a trip down memory lance (wohoo, dat joke!), you ALUs will run dry and not consume their fully energy share.

    I think data locality matters more when talking about large installations with thousands of GPUs which is not what Titan was supposed to live primarily.
     
  12. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,506
    Likes Received:
    424
    Location:
    Varna, Bulgaria
    I'm not sure about the exact configuration (one 16-lane unit or two 8-lane), and the block diagram isn't necessarily an authentic source for that, but most probably each scheduler matches a single 16-lane ALU, for simplicity sake.
     
  13. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,314
    Likes Received:
    140
    Location:
    On the path to wisdom
    The simplest way to think about FP16 instructions is that they really are instructions using 32-bit wide registers, just like any FP32 and INT32 operations.
     
  14. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,943
    Likes Received:
    2,286
    Location:
    Germany
    IIRC, Nvidia changed this from half-warp-feeding in Fermi (due to different clock domains) to single-cycle with Kepler and Maxwell for power reasons. So we'd actually have one 32-wide-group of SPFP-ALUs and one 16-wide-group of DPFP-ALUs in a Pascal SM(P).

    FWIW, in GK110, it was IMHO a similar setup with four groups of 16-wide DPFP-ALUs, each attached to one of the four warp schedulers.
     
  15. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    I don't think there's going to be any major difference at all. Most DRAM timings are specified in ns. That's not going to be different for HBM.
     
  16. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    288
    Likes Received:
    189
    It's not large installations that Nvidia papers talk about.

    I believe you're assuming way too much here.

    And I'm not talking about vram access only. L2 consumes significantly more than L1 and L1 significantly more than RF. The power cost increases in all levels and it's often times orders of magnitude that we are talking about between each level. By contrast it's only a 10% power increase on resulting, overall power. I don't think it's far fetched at all.

    Nvidia swears data locality is the key to power efficiency and it's basing past, present and future GPU designs around that concept. You think data locality doesn't matter. Excuse me if I take the opinion of a company that's betting their future around that concept more seriously than your opinion. Nothing personal. I generally believe in Occam's Razor, so a theory that basically requires everyhing Nvidia said being either incorrect or a blatant lie, is not very attractive to me.
     
  17. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,943
    Likes Received:
    2,286
    Location:
    Germany
    I don't say it doesn't matter. Far from it. I just think it does not matter much in determining whether or not the GK110-based Titan is not allowed to boost while in DP mode and sports a lower base clock there.

    --
    To expand on that a little: I do not assume much at all. Whenever you cannot feed your ALUs out of your register files, they are not running at full throttle - but apart from electrical power that also means that you cannot get to your peak performance. If this happens in very simple tasks already, chances are that you have to overbuild your installation massively in order to reach your performance targets (i.e.nPFLOPS). That's why I draw a line between single-card usage and multiple-thousand-unit large installations for which the Titan is not intended - contrary to the corresponding Tesla cards, which have much lower clocks than the Titan in the first place.

    Mostly, Dallys talks revolve around the exascale machine planned for in a couple of generations. That qualifies as a large installation in my books.
     
    #157 CarstenS, Apr 6, 2016
    Last edited: Apr 6, 2016
  18. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,506
    Likes Received:
    424
    Location:
    Varna, Bulgaria
    Throttling GK110 in DP mode probably comes as a precaution measure in case of sudden power surge due to the more "dense" load the FP64 ALU array puts on the device.
    Intel is has taken a similar measure for their 18-core Haswell-EP, when AVX code is initializing and lowers the Turbo clock.
     
  19. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    288
    Likes Received:
    189
    OK. So to narrow it down, because I'm still confused about your position. You don't think locality has a great effect (from your last post, this does not seem likely). Or you don't think DP would significantly (to the point of increasing power by 10% that is) decrease locality. I'd bet it's the latter for obvious reasons. But I can't see how it wouldn't. There's half as many operands to choose from. And I don't agree that ALUs are going to be sitting around doing nothing when a higher level memory access is in effect, surely they'll find another thread/warp with a higher level of residency to work on...
     
  20. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,601
    Likes Received:
    643
    Location:
    New York
    Yep, it's per warp. Otherwise they would count it as 16 instructions per clock
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...