I'm having trouble following your train of thoughts as well.OK. So to narrow it down, because I'm still confused about your position. You don't think locality has a great effect (from your last post, this does not seem likely). Or you don't think DP would significantly (to the point of increasing power by 10% that is) decrease locality. I'd bet it's the latter for obvious reasons. But I can't see how it wouldn't. There's half as many operands to choose from. And I don't agree that ALUs are going to be sitting around doing nothing when a higher level memory access is in effect, surely they'll find another thread/warp with a higher level of residency to work on...
Might as well be just the delicate timing when feeding the DP ALUs in serial, which, as long as the hardware doesn't have 64bit wide data paths, does require twice as long as regular.- I’m talking about the reason why Geforce GTX Titan is not allowed to boost in DP-mode and has a lower baseclock there as well
The simplest way to think about FP16 instructions is that they really are instructions using 32-bit wide registers, just like any FP32 and INT32 operations.
I don't think the GP100, respectively the boards, even HAVE a DisplayPort, or any display adapter at all.
I'm having trouble following your train of thoughts as well.
To make things clear:
- I’m talking about the reason why Geforce GTX Titan is not allowed to boost in DP-mode and has a lower baseclock there as well
- I’m arguing, that this is merely a precautionary measure than actually a problem with power consumption in DP mode
- I reason that this is because the chip has to feed only 1/3rd the number of ALUs compared to SP-Mode on twice as much/wide/many data paths
I reason that this is because the chip has to feed only 1/3rd the number of ALUs compared to SP-Mode on twice as much/wide/many data paths
Since 1/3rd of ALUs are idling most of the time, for all intents and purposes, on DP-mode the chip is feeding half the number of ALUs compared to SP-Mode on twice as much/wide/many data paths. So the amount of work done and data moved is "at the very least" the same in both SP and DP.
No, and that line alone is without the context I already gave multiple time, which you acknowledge first in the part below only to immediately to discard it.But do you see how your posts are conflicting? For example, you insisnt on this line:
True, the DPFP ALUs should be (a bit?) more power hungry than the SP-ones. I'll throw another thing in the mix then: Rasterizers, Geometry, Tessellators, ROPs. All mostly unused in what's a typical application with DPFP I'd wager. Plus I am quite positive that Ld/St and SFUs cannot be tasked in parallel with the DPFP ALUs edit: This is probably not true, since it says in the Kepler Whitepaper: "Unlike Fermi, which did not permit double precision instructions to be paired with other instructions, Kepler GK110/210 allows double precision instructions to be paired with other instructions." (just not necessarily all other instructions). I don't know from the top of my head whether or not the INT functions might require larger-than.minimally viable adders and multipliers as well as datapaths from the SPFP-ALUs as well, but I would tend to believe so as well.But you've been made abundantly clear, and it's a point that has been made multiple times here on B3D by multiple posters, that Kepler can't actually use all of its ALUs most of the times. And the times in which it can, it's because it can do so without increasing the data being moved around. Hence Kepler's ALU utilization is limited entirely by how much data can be moved around. Since 1/3rd of ALUs are idling most of the time, for all intents and purposes, on DP-mode the chip is feeding half the number of ALUs compared to SP-Mode on twice as much/wide/many data paths. So the amount of work done and data moved is "at the very least" the same in both SP and DP. Now, correct me if I'm wrong but DP ALUs consume a little more than 2x as much as SP ALUs and factor in a little bit of extra memory accesses, due to decreased locality and I don't see how that wouldn't result in slightly higher power consumption...
No, and that line alone is without the context I already gave multiple time, which you acknowledge first in the part below only to immediately to discard it.
What's true is that Kepler is not able to feed more than 2/3rds of its SPFP-ALUs in the worst case. You otoh make this the default behaviour („for all intents and purposes ...“) which it is definitely not.
Kepler was 2/3rds of peak throughput, if you actually tried to read from 3 different registers for a, b and c. If you re-used one slot, you could come close to maximum rate. So, instead of a × b + c = d, you' d have to to sth like a × b + a = d
To add to the reason why Nvidia would probably not boost under DP-Mode: Chances are, that when using DP, you're running something scientific, which would create a more constant and first and foremost much longer load on the ALUs instead of the highly variable load of a game or normal application.
This is actually an assumption we make for some time and that obviously is becoming a reality.Although we did not have official confirmation from Nvidia, several external sources confirmed that our intuition was correct and that there will be no GeForce based on the first big GPU Pascal.
This is not to say that Nvidia abandons the players, on the contrary! A dedicated GPU, clearly upscale oriented is being finalized and should be announced soon according to our information. This should be content with GDDR5 or GDDR5X and operate all of its transistors to the record real time.
Not sure I buy this.... I mean why bother with texturing at all if it is strictly an HPC part.This is actually an assumption we make for some time and that obviously is becoming a reality.Although we did not have official confirmation from Nvidia, several external sources confirmed that our intuition was correct and that there will be no GeForce based on the first big GPU Pascal.
I think the non-HBM2 cards are HBM1 prototype units. I suspect that NVidia decided to "pack" the chips like that so that they could apply active cooling according to the spec for HBM2.The weird thing is that on some P100 modules the HBM stacks are obviously smaller than on the others and the smaller ones don't reflect the light on:
[...]
We know that HBM2 is supposed to have larger die area than HBM1, so what's the deal with the disparity here? Is Nvidia using mechanical samples with dummy filler for the missing memory stacks to cover all the eight sockets for this demo unit?
The yields must truly in the drain, if this is the case.