Nvidia BigK GK110 Kepler Speculation Thread

That 2880 shaders is just assuming similar SMX configuration as GK104, which can't be true.
If it's 192 with half-clockrate FP64 it's possible, but otherwise the 192 figure has to be wrong

The 192 cores SMX configuration for GK110 has been confirmed by Nvidia.
 
Who says GK110 is half-rate?

the Tesla K20 delivers 3x the double precision performance compared to the previous generation Fermi-based Tesla M2090,in the same power envelope.

1Based on DGEMM performance: Tesla M2090 (Fermi) = 330 gigaflops, Tesla K20 (expected) > 1000 gigaflops
http://www.nvidia.com/content/tesla/pdf/NV_DS_TeslaK_Family_May_2012_LR.pdf

PCGH reports GK110 reaches 80-85% efficiency in DGEMM: http://www.pcgameshardware.de/aid,8...Us-auf-GTC-2012-vorgestellt/Grafikkarte/News/

14 SMX * 192SPS * 2 FLOPs * 0.85 * 880MHz /4 = ~1000 GFLOPs DP
 
Last edited by a moderator:
Who says GK110 is half-rate?

http://www.nvidia.com/content/tesla/pdf/NV_DS_TeslaK_Family_May_2012_LR.pdf

PCGH reports GK110 reaches 80-85% efficiency in DGEMM: http://www.pcgameshardware.de/aid,8...Us-auf-GTC-2012-vorgestellt/Grafikkarte/News/

14 SMX * 192SPS * 0.85 * 880MHz /4 = ~1000 GFLOPs DP
The M2090 is rated at 665 GFLOPS double precision. If it's only achieving around half of that in DGEMM it would be a remarkable improvement if Kepler doubled efficiency in that case.
 
With half-rate they could marketing a much higher gain over Fermi.
Maybe it is 1/3 SP per SMX with only 4 of the 6 ALUs processing it, 64 DP-FMA per SMX.
 
Yeah, but the result includes the 2 FLOPs per SP.

But I think 1/3 seems more likely. It would be processed like on GF100/110, the the third super-scalar executed ALU would be without work.
So a bit over 1 TFLOPs DP could be reached with only ~700MHz to hold power down.
 
Blazkowicz said:
[nice that you find something like Xen to be "big iron", I know of a nice installation of it on a pentium E2200 with 2GB ram, running 5 VMs with almost no down time for a few years :)]
Big iron, as in, not my MacBook Air 11" running MS Word. ;)
 
it does have ECC, ECC on Tesla or Quadro is just a software trick. but that means there's ECC in the L1 and L2 so they planned for the eventuality.

DP is there, it's just slow. I believe even the low end geforces support DP. but you can read about it there yourself
http://www.brightsideofnews.com/new...esla-card3b-8gb-ecc-gddr52c-weak-dp-rate.aspx

basically, as even hugely fast FP32 can be useful some industries appeared to say "shut up and take my money".

No offense, but you should get your facts straight:

1. GK104 could have ECC on DRAM
2. GK104 *does not* have ECC on the register files, L1 or L2, go check NV's website

#2 is actually a very significant problem. Soft errors are much more problematic for on-chip SRAM than for DRAM. So having ECC on the DRAM really is just a marketing ploy.

Also, I'd point out that GK104 still sucks for general purpose workloads. If you talk to anyone at Nvidia with half a brain and a shred of honesty, they will readily admit that for quite a few workloads, Fermi is better than GK104.

GK110 is meant for computing, GK104 isn't.

DK
 
But still 6 setup pipes means 48 fragments scan-out capacity, that nicely matches the ROP throughput.

Anyway:

mVFb6.jpg


:LOL:

If true then it's a pretty weird layout; 6 GPCs with 15 SMX?
 
Well, it's all about a good guess at this point, without some official spec's information. If not else, the two additional setup pipes are a welcome increase in the fragment output capacity, considering the vastly improved shader throughput, presented by all the 15 chubby multiprocessors.
 
The 192 cores SMX configuration for GK110 has been confirmed by Nvidia.

Just like 192 was confirmed for GK104, while it was really 192+8? Is this also 192+something, or just 192 which of x can do FP64?
 
Just like 192 was confirmed for GK104, while it was really 192+8? Is this also 192+something, or just 192 which of x can do FP64?

I must have been missing something, but why exactly is it 192+8 on GK104?

[FONT=Arial, Helvetica]Each scheduler has its own registers (4096 x 32 bits) and its own group of four texturing units (each with its own little dedicated cache) and can issue two instructions per cycle but must share resources at this level with a second scheduler:

- SIMD0 32-way unit (the “cores”): 32 FMA FP32 or 4 FMA FP64
- SIMD1 32-way unit (the “cores”): 32 FMA FP32
- SIMD2 32-way unit (the “cores”): 32 FMA FP32
- SFU 16-way unit: 16 FP32 special functions or 32 interpolations
- Load/Store 16-way 64-bit unit


http://www.behardware.com/articles/857-2/review-nvidia-geforce-gtx-680.html
[/FONT]
 
Well, it's all about a good guess at this point, without some official spec's information. If not else, the two additional setup pipes are a welcome increase in the fragment output capacity, considering the vastly improved shader throughput, presented by all the 15 chubby multiprocessors.

It won't take long until the funky go merry rumors appear that the chip actually has N clusters, yet they hammered out X of them for 15 to remain only in the end :devilish:
 
I must have been missing something, but why exactly is it 192+8 on GK104?

http://www.behardware.com/articles/857-2/review-nvidia-geforce-gtx-680.html
[/SIZE][/FONT]

And according to several other sites like http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/2
32 CUDA cores (#1)
• 32 CUDA cores (#2)
• 32 CUDA cores (#3)
• 32 CUDA cores (#4)
• 32 CUDA cores (#5)
• 32 CUDA cores (#6)
• 16 Load/Store Units (#1)
• 16 Load/Store Units (#2)
• 16 Interpolation SFUs (#1)
• 16 Interpolation SFUs (#2)
• 16 Special Function SFUs (#1)
• 16 Special Function SFUs (#2)
• 8 Texture Units (#1)
• 8 Texture Units (#2)
• 8 CUDA FP64 cores

The other change coming from GF114 is the mysterious block #15, the CUDA FP64 block. In order to conserve die space while still offering FP64 capabilities on GF114, NVIDIA only made one of the three CUDA core blocks FP64 capable. In turn that block of CUDA cores could execute FP64 instructions at a rate of ¼ FP32 performance, which gave the SM a total FP64 throughput rate of 1/12th FP32. In GK104 none of the regular CUDA core blocks are FP64 capable; in its place we have what we’re calling the CUDA FP64 block.
The CUDA FP64 block contains 8 special CUDA cores that are not part of the general CUDA core count and are not in any of NVIDIA’s diagrams. These CUDA cores can only do and are only used for FP64 math. What's more, the CUDA FP64 block has a very special execution rate: 1/1 FP32. With only 8 CUDA cores in this block it takes NVIDIA 4 cycles to execute a whole warp, but each quarter of the warp is done at full speed as opposed to ½, ¼, or any other fractional speed that previous architectures have operated at. Altogether GK104’s FP64 performance is very low at only 1/24 FP32 (1/6 * ¼), but the mere existence of the CUDA FP64 block is quite interesting because it’s the very first time we’ve seen 1/1 FP32 execution speed. Big Kepler may not end up resembling GK104, but if it does then it may be an extremely potent FP64 processor if it’s built out of CUDA FP64 blocks.
 
No offense, but you should get your facts straight:

1. GK104 could have ECC on DRAM
2. GK104 *does not* have ECC on the register files, L1 or L2, go check NV's website

#2 is actually a very significant problem. Soft errors are much more problematic for on-chip SRAM than for DRAM. So having ECC on the DRAM really is just a marketing ploy.

Also, I'd point out that GK104 still sucks for general purpose workloads. If you talk to anyone at Nvidia with half a brain and a shred of honesty, they will readily admit that for quite a few workloads, Fermi is better than GK104.

GK110 is meant for computing, GK104 isn't.

DK
The omission of ECC on-die was made more obvious when Nvidia listed the feature set of the GK110 against GK104. There must be a subset of HPC that can tolerate transient errors at the level of errors to be expected of gamer GPU SRAM, which may not be held to the same error rates a chip like Opteron would.

Is this a reactionary move to guard big Kepler's underside from Tahiti or its successor?


A fair number of the features outlined are understandably included in GCN's roadmap for now, soon, or very soon, though it could be just one of a number of instances where AMD gets to its own starting line later.
 
Last edited by a moderator:
Back
Top