Spec's may look unimpressive but it delivers in the only "benchmark" they have released to date - Alexnet fits their narrative.The specs look unimpressive for a 610mm2 chip. Perhaps they will get another 600mm2 chip for gaming purposes.
The clockspeed is pretty impressive. 2Ghz air cooled overclocked cards incoming?
Yeah and worth remembering just how quickly the performance ceiling vs TDP/heat is reached when it hits the upper limits of clocking (Maxwell was good but it lost efficiency once the clocking reached a certain point and maybe we will see the same again).One explanation could be that Nvidia is more conservative with HPC and automotive systems.
I don't think we'll see 2GHz air cooled, and it'll be a big stretch with forced cooling. But it's not completely outrageous.
But finfet has a sharper curve, a higher curve point, but sharper, for either foundry process. The big chip gets a higher clock versus 28nm, the smaller chips get a lower clock, that's simply the math of it. And considering it's a new process, with the first "big" GPU off it already have disabled modules to increase yields, and the process is already more expensive (per mm^2) than the previous one, well this is going to be a very expensive and possibly quite limited card.
What consumer, or otherwise, cards there are will have to wait until June. And since there's no mention of price/availability/etc. from the announcement there's limited guesses that can be made.
Expensive is very relative when you're really talking about the difference of a few percentage points more or less on what's probably a gross margin of 80%.
Spec's may look unimpressive but it delivers in the only "benchmark" they have released to date - Alexnet fits their narrative.
As a very rough ballpark (calculated using the Titan X) Maxwell M40 manages around 2,700 images/sec, and the new Pascal-cuDNNV5 replacement manages around 4,500 images/sec.
Cheers
I do not see any issues with the Pascal card, there was a chart as part of the presentation showing those performance and with clocks, only the Pascal had boost clocks as part of the values so seems this is considered.Hi guys. Regarding the TDP, I'm thinking that it will only actually reach those levels in DP operations. Remember that on the GK110 based Titan enabing 1:3 DP ratio had consequences:
"The penalty for enabling full speed FP64 mode is that NVIDIA has to reduce clockspeeds to keep everything within spec. For our sample card this manifests itself as GPU Boost being disabled, forcing our card to run at 837MHz (or lower) at all times. And while we haven't seen it first-hand, NVIDIA tells us that in particularly TDP constrained situations Titan can drop below the base clock to as low as 725MHz. This is why NVIDIA’s official compute performance figures are 4.5 TFLOPS for FP32, but only 1.3 TFLOPS for FP64. The former is calculated around the base clock speed, while the latter is calculated around the worst case clockspeed of 725MHz. The actual execution rate is still 1/3."
Source
I figure that 1:2 DP ratio must have a larger impact, but shouldn't have an impact under FP32 operation. Thoughts?
I do not see any issues with the Pascal card, there was a chart as part of the presentation showing those performance and with clocks, only the Pascal had boost clocks as part of the values so seems this is considered.
Cheers
I am aware of Nvidias reasoning for GK110's DP-Mode, but I didn't buy it back then.Hi guys. Regarding the TDP, I'm thinking that it will only actually reach those levels in DP operations. Remember that on the GK110 based Titan enabing 1:3 DP ratio had consequences:
"The penalty for enabling full speed FP64 mode is that NVIDIA has to reduce clockspeeds to keep everything within spec. For our sample card this manifests itself as GPU Boost being disabled, forcing our card to run at 837MHz (or lower) at all times. And while we haven't seen it first-hand, NVIDIA tells us that in particularly TDP constrained situations Titan can drop below the base clock to as low as 725MHz. This is why NVIDIA’s official compute performance figures are 4.5 TFLOPS for FP32, but only 1.3 TFLOPS for FP64. The former is calculated around the base clock speed, while the latter is calculated around the worst case clockspeed of 725MHz. The actual execution rate is still 1/3."
Source
I figure that 1:2 DP ratio must have a larger impact, but shouldn't have an impact under FP32 operation. Thoughts?
I am aware of Nvidias reasoning for GK110's DP-Mode, but I didn't buy it back then.
It somehow eludes me, how you should use more energy with only two thirds of the calculations and data moved around. AFAIAA, SP-units could not be fed by the register file while DP-units where busy, because it was tailored to the latter's needs.
I think Fermi Tesla SKU had a lower clock for the ROP/L2 domain than the GeForce and Quadro units.Fermi and Maxwell Teslas were clocked closer to their consumer counterparts.
I have no idea what the actual reason was. I found that OC worked indeed better and more reliably with a Titan in DP-Mode, because you could set the clock speed without boost interfering.
Probably, it was just precautionary?
It somehow eludes me, how you should use more energy with only two thirds of the calculations and data moved around.
Ah sorry, I interpreted your post as meaning they will not hit their higher clock speed for FP64/FP32.I'm not saying there's any issues. What I'm saying is that instead of operating at lower frequencies like they did on Titan, they simply increased the TDP.
I don't know, it just was more stable.
Kepler was 2/3rds of peak throughput, if you actually tried to read from 3 different registers for a, b and c. If you re-used one slot, you could come close to maximum rate. So, instead of a × b + c = d, you' d have to to sth like a × b + a = d
with only two thirds of the calculations and data moved around
My hunch is, the actually hard part was to design the register file accordingly and it was specced to fully saturate the 960 DPFP units (each with 3 64b reads and 1 64b write, maybe the write port could even be a shared port, since the write occurs after the read). With fully independent operands, that would suffice for two SP-FMAs instead of one 64b, thus 64 out of 192 SP-ALUs running dry.Doen't that kind of explain this?
In both cases only 2/3rds of calculations (compared to theoretical maximum) and data moved around. Right? Although DP:SP ratio is supposed to be 1:3, doesn't that limitation make it more like 1:2 in practice?
How can it? It's 64 bits per Operand instead of 32, but with only 960 instead of 2880 ALUs, so 2/3rds.And then, is the data moved around really the same? DP doesn't increase it even just a little bit? Does 1:2 rate translate to just requiring half the amount of threads or warps (excuse my ignorance), and thus potentially 1/2 the required operands, or do you in practice still require a higher number of warps to keep the execution units busy? To me it would look like statistically DP would prompt more memory reads, hence stalls, meaning more than 1/2 the work items required to mantain same occupancy. (real sorry if I used the terms wrong, I hope the concept is still understandable)