Nvidia BigK GK110 Kepler Speculation Thread

Tridam · May 15, 2012

Kaotik said:
That 2880 shaders is just assuming similar SMX configuration as GK104, which can't be true.
If it's 192 with half-clockrate FP64 it's possible, but otherwise the 192 figure has to be wrong

The 192 cores SMX configuration for GK110 has been confirmed by Nvidia.

AnarchX · May 15, 2012

Who says GK110 is half-rate?

the Tesla K20 delivers 3x the double precision performance compared to the previous generation Fermi-based Tesla M2090,in the same power envelope.

1Based on DGEMM performance: Tesla M2090 (Fermi) = 330 gigaflops, Tesla K20 (expected) > 1000 gigaflops

http://www.nvidia.com/content/tesla/pdf/NV_DS_TeslaK_Family_May_2012_LR.pdf

PCGH reports GK110 reaches 80-85% efficiency in DGEMM: http://www.pcgameshardware.de/aid,8...Us-auf-GTC-2012-vorgestellt/Grafikkarte/News/

14 SMX * 192SPS * 2 FLOPs * 0.85 * 880MHz /4 = ~1000 GFLOPs DP

Kaotik · May 15, 2012

AnarchX said:
Who says GK110 is half-rate?

http://www.nvidia.com/content/tesla/pdf/NV_DS_TeslaK_Family_May_2012_LR.pdf

PCGH reports GK110 reaches 80-85% efficiency in DGEMM: http://www.pcgameshardware.de/aid,8...Us-auf-GTC-2012-vorgestellt/Grafikkarte/News/

14 SMX * 192SPS * 0.85 * 880MHz /4 = ~1000 GFLOPs DP

Half, quarter, whatever it is it can't be same configuration as GK104 since GK104 is 192+8, not 192 capable of 1/24 or 192 which of 8 are FP64 fullrate capable

OpenGL guy · May 15, 2012

AnarchX said:
Who says GK110 is half-rate?

http://www.nvidia.com/content/tesla/pdf/NV_DS_TeslaK_Family_May_2012_LR.pdf

PCGH reports GK110 reaches 80-85% efficiency in DGEMM: http://www.pcgameshardware.de/aid,8...Us-auf-GTC-2012-vorgestellt/Grafikkarte/News/

14 SMX * 192SPS * 0.85 * 880MHz /4 = ~1000 GFLOPs DP

The M2090 is rated at 665 GFLOPS double precision. If it's only achieving around half of that in DGEMM it would be a remarkable improvement if Kepler doubled efficiency in that case.

AnarchX · May 15, 2012

With half-rate they could marketing a much higher gain over Fermi.
Maybe it is 1/3 SP per SMX with only 4 of the 6 ALUs processing it, 64 DP-FMA per SMX.

CarstenS · May 15, 2012

AnarchX said:
Who says GK110 is half-rate?

http://www.nvidia.com/content/tesla/pdf/NV_DS_TeslaK_Family_May_2012_LR.pdf

PCGH reports GK110 reaches 80-85% efficiency in DGEMM: http://www.pcgameshardware.de/aid,8...Us-auf-GTC-2012-vorgestellt/Grafikkarte/News/

14 SMX * 192SPS * 0.85 * 880MHz /4 = ~1000 GFLOPs DP

You forgot that every FMA/MAD counts as 2 FLOPS, so it'd be half-rate again.

AnarchX · May 15, 2012

Yeah, but the result includes the 2 FLOPs per SP.

But I think 1/3 seems more likely. It would be processed like on GF100/110, the the third super-scalar executed ALU would be without work.
So a bit over 1 TFLOPs DP could be reached with only ~700MHz to hold power down.

silent_guy · May 16, 2012

Blazkowicz said:
[nice that you find something like Xen to be "big iron", I know of a nice installation of it on a pentium E2200 with 2GB ram, running 5 VMs with almost no down time for a few years ]

Big iron, as in, not my MacBook Air 11" running MS Word.

A1xLLcqAgt0qc2RyMz0y · May 16, 2012

GK110 White paper

Has a white paper been released for the GK110?

lanek · May 16, 2012

This pdf have been posted i dont remember where but thats all i have seen.
http://www.nvidia.com/content/tesla/pdf/NV_DS_TeslaK_Family_May_2012_LR.pdf

( lol sorry, one page ago here too )

dkanter · May 16, 2012

Blazkowicz said:
it does have ECC, ECC on Tesla or Quadro is just a software trick. but that means there's ECC in the L1 and L2 so they planned for the eventuality.

DP is there, it's just slow. I believe even the low end geforces support DP. but you can read about it there yourself
http://www.brightsideofnews.com/new...esla-card3b-8gb-ecc-gddr52c-weak-dp-rate.aspx

basically, as even hugely fast FP32 can be useful some industries appeared to say "shut up and take my money".

No offense, but you should get your facts straight:

1. GK104 could have ECC on DRAM
2. GK104 *does not* have ECC on the register files, L1 or L2, go check NV's website

#2 is actually a very significant problem. Soft errors are much more problematic for on-chip SRAM than for DRAM. So having ECC on the DRAM really is just a marketing ploy.

Also, I'd point out that GK104 still sucks for general purpose workloads. If you talk to anyone at Nvidia with half a brain and a shred of honesty, they will readily admit that for quite a few workloads, Fermi is better than GK104.

GK110 is meant for computing, GK104 isn't.

DK

dkanter · May 16, 2012

fellix said:
But still 6 setup pipes means 48 fragments scan-out capacity, that nicely matches the ROP throughput.

Very nice work : )

DK

Ailuros · May 16, 2012

fellix said:
But still 6 setup pipes means 48 fragments scan-out capacity, that nicely matches the ROP throughput.

Anyway:

If true then it's a pretty weird layout; 6 GPCs with 15 SMX?

fellix · May 16, 2012

Well, it's all about a good guess at this point, without some official spec's information. If not else, the two additional setup pipes are a welcome increase in the fragment output capacity, considering the vastly improved shader throughput, presented by all the 15 chubby multiprocessors.

Kaotik · May 16, 2012

Tridam said:
The 192 cores SMX configuration for GK110 has been confirmed by Nvidia.

Just like 192 was confirmed for GK104, while it was really 192+8? Is this also 192+something, or just 192 which of x can do FP64?

Man from Atlantis · May 16, 2012

http://www.abload.de/img/0015bdxa.jpg

Ailuros · May 16, 2012

Kaotik said:
Just like 192 was confirmed for GK104, while it was really 192+8? Is this also 192+something, or just 192 which of x can do FP64?

I must have been missing something, but why exactly is it 192+8 on GK104?

[FONT=Arial, Helvetica]Each scheduler has its own registers (4096 x 32 bits) and its own group of four texturing units (each with its own little dedicated cache) and can issue two instructions per cycle but must share resources at this level with a second scheduler:

- SIMD0 32-way unit (the “cores”): 32 FMA FP32 or 4 FMA FP64
- SIMD1 32-way unit (the “cores”): 32 FMA FP32
- SIMD2 32-way unit (the “cores”): 32 FMA FP32
- SFU 16-way unit: 16 FP32 special functions or 32 interpolations
- Load/Store 16-way 64-bit unit

http://www.behardware.com/articles/857-2/review-nvidia-geforce-gtx-680.html
[/FONT]

Ailuros · May 16, 2012

fellix said:
Well, it's all about a good guess at this point, without some official spec's information. If not else, the two additional setup pipes are a welcome increase in the fragment output capacity, considering the vastly improved shader throughput, presented by all the 15 chubby multiprocessors.

It won't take long until the funky go merry rumors appear that the chip actually has N clusters, yet they hammered out X of them for 15 to remain only in the end

Kaotik · May 16, 2012

Ailuros said:
I must have been missing something, but why exactly is it 192+8 on GK104?

http://www.behardware.com/articles/857-2/review-nvidia-geforce-gtx-680.html
[/SIZE][/FONT]

And according to several other sites like http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/2

32 CUDA cores (#1)
• 32 CUDA cores (#2)
• 32 CUDA cores (#3)
• 32 CUDA cores (#4)
• 32 CUDA cores (#5)
• 32 CUDA cores (#6)
• 16 Load/Store Units (#1)
• 16 Load/Store Units (#2)
• 16 Interpolation SFUs (#1)
• 16 Interpolation SFUs (#2)
• 16 Special Function SFUs (#1)
• 16 Special Function SFUs (#2)
• 8 Texture Units (#1)
• 8 Texture Units (#2)
• 8 CUDA FP64 cores

The other change coming from GF114 is the mysterious block #15, the CUDA FP64 block. In order to conserve die space while still offering FP64 capabilities on GF114, NVIDIA only made one of the three CUDA core blocks FP64 capable. In turn that block of CUDA cores could execute FP64 instructions at a rate of ¼ FP32 performance, which gave the SM a total FP64 throughput rate of 1/12th FP32. In GK104 none of the regular CUDA core blocks are FP64 capable; in its place we have what we’re calling the CUDA FP64 block.
The CUDA FP64 block contains 8 special CUDA cores that are not part of the general CUDA core count and are not in any of NVIDIA’s diagrams. These CUDA cores can only do and are only used for FP64 math. What's more, the CUDA FP64 block has a very special execution rate: 1/1 FP32. With only 8 CUDA cores in this block it takes NVIDIA 4 cycles to execute a whole warp, but each quarter of the warp is done at full speed as opposed to ½, ¼, or any other fractional speed that previous architectures have operated at. Altogether GK104’s FP64 performance is very low at only 1/24 FP32 (1/6 * ¼), but the mere existence of the CUDA FP64 block is quite interesting because it’s the very first time we’ve seen 1/1 FP32 execution speed. Big Kepler may not end up resembling GK104, but if it does then it may be an extremely potent FP64 processor if it’s built out of CUDA FP64 blocks.

3dilettante · May 16, 2012

dkanter said:
No offense, but you should get your facts straight:

1. GK104 could have ECC on DRAM
2. GK104 *does not* have ECC on the register files, L1 or L2, go check NV's website

#2 is actually a very significant problem. Soft errors are much more problematic for on-chip SRAM than for DRAM. So having ECC on the DRAM really is just a marketing ploy.

Also, I'd point out that GK104 still sucks for general purpose workloads. If you talk to anyone at Nvidia with half a brain and a shred of honesty, they will readily admit that for quite a few workloads, Fermi is better than GK104.

GK110 is meant for computing, GK104 isn't.

DK

The omission of ECC on-die was made more obvious when Nvidia listed the feature set of the GK110 against GK104. There must be a subset of HPC that can tolerate transient errors at the level of errors to be expected of gamer GPU SRAM, which may not be held to the same error rates a chip like Opteron would.

Is this a reactionary move to guard big Kepler's underside from Tahiti or its successor?

A fair number of the features outlined are understandably included in GCN's roadmap for now, soon, or very soon, though it could be just one of a number of instances where AMD gets to its own starting line later.

Nvidia BigK GK110 Kepler Speculation Thread

Tridam

AnarchX

Kaotik

Drunk Member

OpenGL guy

AnarchX

CarstenS

Moderator

AnarchX

silent_guy

A1xLLcqAgt0qc2RyMz0y

lanek

dkanter

dkanter

Ailuros

Epsilon plus three

fellix

Kaotik

Drunk Member

Man from Atlantis

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

Kaotik

Drunk Member

3dilettante

Similar threads