Nvidia BigK GK110 Kepler Speculation Thread

http://www.eetimes.com/electronics-...ers-must-demand-better-EDA-tools?pageNumber=1

In addition to trying to adhere to these rules, Alben said Nvidia has had a two-year collaboration with Synopsys and the EDA giant's VCS tool in which they push the enabling GPU simulation within VCS.

In other words, they "take the design under test (DUT) and move that whole simulation over to the GPU, leaving only the test bench simulation side on the CPU side," Alben said.

The technology is a prototype, "but it's gotten to the point where we have everything we need to run our unit RTL verification environments," he said.

Nvidia runs a simulation farm that has NVidia Tesla K10 accelerator cards installed in them, each running two DUTs. It claims that approach speeds simulation by 5 times, he said.
Nvidia also is collaborating with Rocketick on ATPG gate simulations. Rocketick's RocketSim software offloads calculations to a GPU to shrink verification time. Alben said Nvidia is seeing "extremely high speed up on K10 cards of 17.1x." In one case, a gate simulation shrank from 20 days to 16 hours, he added.
 
Nvidia is happy with Kepler's compute performance with the 5 to 17x speedups, certainly they wouldn't engineer Big Kepler that will underperform in compute.
 
There's one trick GF110 (and GF100 for that matter) did not have up it's sleeve: It did not use dual issue in/after it's warp schedulers, but more fine grained control logic. thus not relying on extracting ILP for maximum utilization. I firmly believe that quite a bit of GF110's higher performance compared to GF114 is coming from this and not all is attributable to higher bandwidth. As a small hint I take the results from my earlier experiment over here - where GF114 would be characterized a little similar to HD 7970, i.e. not behaving as "scalarly" as GF110.

In Kepler, there's no such difference.

Not all maybe, but still the highest quantity. I merely disgress at the notion many seem to present that bandwidth is less relevant even for compute eventually.

And since you seem to claim 58% more theoretical Tris in GK110, for reference I multiplied 4 trisetups in one case with 1006MHz and for the other 5 * 850MHz so I'd love to know where that 58% difference exactly comes from. Either way as I said it's a rather meaningless detail since basic geometry throughput isn't going to be anywhere near on desktop SKUs the real peaks the hw is capable of.
 
Nvidia is happy with Kepler's compute performance with the 5 to 17x speedups, certainly they wouldn't engineer Big Kepler that will underperform in compute.

Point accepted; I don't expect GK110 to be weak with compute rather the exact contrary, but that's just me.
 
Not all maybe, but still the highest quantity. I merely disgress at the notion many seem to present that bandwidth is less relevant even for compute eventually.
Of course bandwidth is not unimportant for Compute - you need to move data into caches fast.

Problem is, GK110 has 87,5% more compute ressources but only 50% more bandwidth than GK104 (throughput-wise, so per clock). So, on a per unit basis, it does not have much of a bandwidth advantage, even if you adjust for clock slighty (1006/1500 vs. 850/1400 MHz):
GK110: 4896 GFLOPS/268,8 GB/s -> 18,21 FLOPS/Byte
while
GK104: 3090 GFLOPS/192,3 GB/s -> 16,07 FLOPS/Byte

So, I'd rather see a 13% [strike]in[/strike]decrease in bandwidth.


And since you seem to claim 58% more theoretical Tris in GK110, for reference I multiplied 4 trisetups in one case with 1006MHz and for the other 5 * 850MHz so I'd love to know where that 58% difference exactly comes from. Either way as I said it's a rather meaningless detail since basic geometry throughput isn't going to be anywhere near on desktop SKUs the real peaks the hw is capable of.

1 Vertex Fetch per SMX every other cycle.
And no/yes: Geometry throughput won't be of much concern in the desktop space either way.
 
Last edited by a moderator:
Of course bandwidth is not unimportant for Compute - you need to move data into caches fast.

Problem is, GK110 has 87,5% more compute ressources but only 50% more bandwidth than GK104 (throughput-wise, so per clock). So, on a per unit basis, it does not have much of a bandwidth advantage, even if you adjust for clock slighty (1006/1500 vs. 850/1400 MHz):
GK110: 4896 GFLOPS/268,8 GB/s -> 18,21 FLOPS/Byte
while
GK104: 3090 GFLOPS/192,3 GB/s -> 16,07 FLOPS/Byte

So, I'd rather see a 13% increase in bandwidth.

Under that reasoning we have for GF110 8,22 FLOPs/Byte vs. GF114 9.87 FLOPs/Byte.

1 Vertex Fetch per SMX every other cycle.
And no/yes: Geometry throughput won't be of much concern in the desktop space either way.

In pure theory for the first ignoring any other possible bottlenecks.
 
Under that reasoning we have for GF110 8,22 FLOPs/Byte vs. GF114 9.87 FLOPs/Byte.
Correct (please see my typo above). Less Compute per Byte for GF110 vs. GF114, more for GK110 vs GK104 (fully enabled, 850/1400 MHz), alleviated to an unkown extent by 2x L2 cache for GK110.
 
For real-time graphics, the main benefit of the coherent L2 is to support the distributed setup pipeline and that doesn't require that much capacity (GK104 is just fine with 512K out there). The other use is for global sync primitives, since NV doesn't feature dedicated global share memory structure.

now, if you do some round-the-clock PT rendering on the GPU, then... :p
 
From what I read, everything gets served from memory through L2. Coherent L2 was surely the main enabler for Nvidias distributed setup, I concurr with you on this one, but it's not it's only (or main) use.
 
I've seen people having that intuition, and I totally believe GK110 will look bad too.
- it's graphics, all in FP32, no FP64
- it would be running the same code : not made for the new features
unless maybe HyperQ (and nothing else) can be exploited transparently, with some driver magic, I don't know.
Doubled L2 is another significant factor that may help the GK110 (256K per memory controller instead of 128K)

Else, my theory is the code is particularly optimised for Fermi and so you would need new code for Kepler to match or outperform it.
It'll still have a bit higher FP64 than GK104. The rate should be 1/16 FP32, compared to 1/24 FP32 on GK104 and 1/8 FP32 on GF110.
 
I don't know who the admin of redquasar is, or if he has any dependable sources, but he is saying Geforce Titan will outperform a gtx690: http://www.redquasar.com/forum.php?mod=viewthread&tid=11373&extra=page=1 (picked up by this website: http://wccftech.com/nvidia-geforce-titan-gtx-780-performance-surpasses-gtx-690/ )

Also, anyone have an expreview.com account? Supposedly this post here: http://bbs.expreview.com/thread-55837-1-1.html contains a pic of Geforce Titan's PCB.

Ok let's see we have a pessimistic scenario that wants Titan to be only 20-30% ahead of GK104 and an optimistic one that wants it to be by 80-90% ahead. I love simplicity so under a KISS approach let's cut it somewhere in the middle and we might have a winner ;)
 
http://nvidianews.nvidia.com/Releas...s-World-Record-for-Energy-Efficiency-910.aspx

The Eurora supercomputer, built by Eurotech and deployed Wednesday at the Cineca facility in Bologna, Italy, the country's most powerful supercomputing center, reached 3,150 megaflops per watt of sustained performance(1) -- a mark 26 percent higher than the top system on the most recent Green500 list of the world's most efficient supercomputers.

Larrabee/Xeon Phi's utter failure and defeat in Green500 list is complete now.
 
Back
Top