Nvidia Pascal Announcement

How can it? It's 64 bits per Operand instead of 32, but with only 960 instead of 2880 ALUs, so 2/3rds.

Maybe it's me, but that conflicts directly with your other statement. It looks like in practice it's 960 vs 1920 ALUs, so the same instead of 2/3rds. And in fact, it looks like registers are the limiting factor in both cases.

And my reasoning is that with a double, you can hold half as many operands. Even if you only have half the ALUs to feed, I'm pretty sure that trips to memory are going to be more frequent and you'd again only fetch 1/2 the amount of operands, decreasing the probability that what was copied along would be required in the immediate future, thus requiring another access sooner than you would with single precision. I'm not talking about massive increases, but I'm pretty sure the bandwidth requirements are higher.
 
I really don't get, how NV did FP64 on Pascal.

According to some presentation, Pascal can issue a pair of fp16 per clock, a fp32-op per clock and a fp64-op every two clocks.
The devblog still states that there are dedicated fp64 units on Pascal.

If Pascal has only half the dp64 units, and can only issue an fp64 op every two clocks, does this lead to a 1:4-ratio rather than a 1:2-ratio?
Can somebody explain this?
Where can one see „some presentation“? I have yet so see a presentation with so much detail as instruction issue rate wrt Pascaal.
 
Maybe it's me, but that conflicts directly with your other statement. It looks like in practice it's 960 vs 1920 ALUs, so the same instead of 2/3rds. And in fact, it looks like registers are the limiting factor in both cases.
What's DPFP power worst case still has headroom on SPFP, where with re-used operands more ALUs could be utliized. Thus, the potential power draw in SP-mode is higher and it makes no sense to limit frequencies for DP-mode only. That's my train of thoughts in a nutshell.

And my reasoning is that with a double, you can hold half as many operands. Even if you only have half the ALUs to feed, I'm pretty sure that trips to memory are going to be more frequent and you'd again only fetch 1/2 the amount of operands, decreasing the probability that what was copied along would be required in the immediate future, thus requiring another access sooner than you would with single precision. I'm not talking about massive increases, but I'm pretty sure the bandwidth requirements are higher.
So you're referring to memory accesses? As soon as you're having trips to memory occuring in order to relief register file limitation, chances are, you are not nearly running at your occupancy, thus power limit.
 
Where can one see „some presentation“? I have yet so see a presentation with so much detail as instruction issue rate wrt Pascaal.

9-1080.3618039559.jpg


I hope deep linking on computerbase is allowed. But this is the slide, I was referring to.

So how does that concur?
 
What you have to take into consideration is the thread group/warp size of 32 that stays the same between SP and DP as well as the difference between latency and throughput.
 
..issue rate..

Perhaps issue rate is being used as a proxy (or physical layout is being used as a proxy). We've been talking about the fp32 units being virtually split for fp16 math, but one would not claim that there are twice the fp16 units running at twice the rate, or 4x total throughput. It's possible that the issue rate is half rate, but that's the issue rate of a full-width set of ops, and not the half-width native to the fp64 hardware?

[heh, three people said at the same time]

Maybe a more interesting question -- from where can the pair of fp16 ops come from? Different warps? only consecutive operations in same warp?
 
So you're referring to memory accesses? As soon as you're having trips to memory occuring in order to relief register file limitation, chances are, you are not nearly running at your occupancy, thus power limit.

Not so sure regarding power limit. Nvidia's focus on data locality over the years as a means to increase efficiency and the obvious results over the same years, IMO speak for themselves. Considering the results I don't think all of that Bill Dally and company talk was just, well, talk.
 
Each scheduler in the SM has 16xFP64 lanes attached to it, so a single warp would need two cycles to complete. Maybe that's the meaning of the issue rate for DP. :???:

Thats very plausible. So there are no 2x8xFP64 units per scheduler, but 16xFP64? Then this makes some sense.
 
Maybe a more interesting question -- from where can the pair of fp16 ops come from? Different warps? only consecutive operations in same warp?
Pretty sure that's a VLIW2 op. So that doubled FP16 rate is more of a theoretical maximum.

Would be interesting to know if memory access needs to be aligned (same / packed operands).
 
Of course it is much more expensive energy-wise to have data fetched from system or local off-chip memory, no doubt about it. Especially, when you ignore the hundreds of cycle latency penalty you pay for this and pretend that your ALUs are running full throttle in the meantime anyway. Reality though, is that it is much more likely that whenever you cannot feed your ALUs from the register file, nor from L1 cache, nor from L2 cache but have to take a trip down memory lance (wohoo, dat joke!), you ALUs will run dry and not consume their fully energy share.

I think data locality matters more when talking about large installations with thousands of GPUs which is not what Titan was supposed to live primarily.
 
So there are no 2x8xFP64 units per scheduler, but 16xFP64?
I'm not sure about the exact configuration (one 16-lane unit or two 8-lane), and the block diagram isn't necessarily an authentic source for that, but most probably each scheduler matches a single 16-lane ALU, for simplicity sake.
 
Maybe a more interesting question -- from where can the pair of fp16 ops come from? Different warps? only consecutive operations in same warp?
The simplest way to think about FP16 instructions is that they really are instructions using 32-bit wide registers, just like any FP32 and INT32 operations.
 
I'm not sure about the exact configuration (one 16-lane unit or two 8-lane), and the block diagram isn't necessarily an authentic source for that.
IIRC, Nvidia changed this from half-warp-feeding in Fermi (due to different clock domains) to single-cycle with Kepler and Maxwell for power reasons. So we'd actually have one 32-wide-group of SPFP-ALUs and one 16-wide-group of DPFP-ALUs in a Pascal SM(P).

FWIW, in GK110, it was IMHO a similar setup with four groups of 16-wide DPFP-ALUs, each attached to one of the four warp schedulers.
 
Do we know how HBM2 latency compares to GDDR5? I guess at 1.4GHz it must be quite high in absolute terms, while the increase in GPU clock speed also increases the relative latency in clock cycles.
I don't think there's going to be any major difference at all. Most DRAM timings are specified in ns. That's not going to be different for HBM.
 
I think data locality matters more when talking about large installations with thousands of GPUs which is not what Titan was supposed to live primarily.

It's not large installations that Nvidia papers talk about.

you ALUs will run dry and not consume their fully energy share

I believe you're assuming way too much here.

And I'm not talking about vram access only. L2 consumes significantly more than L1 and L1 significantly more than RF. The power cost increases in all levels and it's often times orders of magnitude that we are talking about between each level. By contrast it's only a 10% power increase on resulting, overall power. I don't think it's far fetched at all.

Nvidia swears data locality is the key to power efficiency and it's basing past, present and future GPU designs around that concept. You think data locality doesn't matter. Excuse me if I take the opinion of a company that's betting their future around that concept more seriously than your opinion. Nothing personal. I generally believe in Occam's Razor, so a theory that basically requires everyhing Nvidia said being either incorrect or a blatant lie, is not very attractive to me.
 
I don't say it doesn't matter. Far from it. I just think it does not matter much in determining whether or not the GK110-based Titan is not allowed to boost while in DP mode and sports a lower base clock there.

--
To expand on that a little: I do not assume much at all. Whenever you cannot feed your ALUs out of your register files, they are not running at full throttle - but apart from electrical power that also means that you cannot get to your peak performance. If this happens in very simple tasks already, chances are that you have to overbuild your installation massively in order to reach your performance targets (i.e.nPFLOPS). That's why I draw a line between single-card usage and multiple-thousand-unit large installations for which the Titan is not intended - contrary to the corresponding Tesla cards, which have much lower clocks than the Titan in the first place.

It's not large installations that Nvidia papers talk about.
Mostly, Dallys talks revolve around the exascale machine planned for in a couple of generations. That qualifies as a large installation in my books.
 
Last edited:
I just think it does not matter much in determining whether or not the GK110-based Titan is not allowed to boost while in DP mode and sports a lower base clock there.
Throttling GK110 in DP mode probably comes as a precaution measure in case of sudden power surge due to the more "dense" load the FP64 ALU array puts on the device.
Intel is has taken a similar measure for their 18-core Haswell-EP, when AVX code is initializing and lowers the Turbo clock.
 
I don't say it doesn't matter. Far from it. I just think it does not matter much in determining whether or not the GK110-based Titan is not allowed to boost while in DP mode and sports a lower base clock there.

OK. So to narrow it down, because I'm still confused about your position. You don't think locality has a great effect (from your last post, this does not seem likely). Or you don't think DP would significantly (to the point of increasing power by 10% that is) decrease locality. I'd bet it's the latter for obvious reasons. But I can't see how it wouldn't. There's half as many operands to choose from. And I don't agree that ALUs are going to be sitting around doing nothing when a higher level memory access is in effect, surely they'll find another thread/warp with a higher level of residency to work on...
 
Each scheduler in the SM has 16xFP64 lanes attached to it, so a single warp would need two cycles to complete. Maybe that's the meaning of the issue rate for DP. :???:

Yep, it's per warp. Otherwise they would count it as 16 instructions per clock [emoji6]
 
Back
Top