Nvidia BigK GK110 Kepler Speculation Thread

Well maybe Damien Triolet can shed some light into that one.

I was describing half of an SMX as 2 warp schedulers have to share half of an SMX ressources (except registers/tex units/tex cache which are tied to each scheduler)
 
The omission of ECC on-die was made more obvious when Nvidia listed the feature set of the GK110 against GK104. There must be a subset of HPC that can tolerate transient errors at the level of errors to be expected of gamer GPU SRAM, which may not be held to the same error rates a chip like Opteron would.

Yes there are. But you're sort of missing my point. Soft errors in on-chip SRAM is a far bigger problem than in DRAM. There is no point having ECC memory when your on-chip SRAMs do not have sufficient reliability. So while they might have ECC for DRAM, it's just for show.

Is this a reactionary move to guard big Kepler's underside from Tahiti or its successor?

A fair number of the features outlined are understandably included in GCN's roadmap for now, soon, or very soon, though it could be just one of a number of instances where AMD gets to its own starting line later.

It's really a reflection of the fact that GK104 isn't meant for compute...while GK110 is.

DK
 
Yes there are. But you're sort of missing my point. Soft errors in on-chip SRAM is a far bigger problem than in DRAM. There is no point having ECC memory when your on-chip SRAMs do not have sufficient reliability. So while they might have ECC for DRAM, it's just for show.
It was not my intent to imply otherwise for DRAM ECC. I was reflecting on the fact that Nvidia must believe there is a sufficient market with workloads that can live without DP and error correction to justify putting together these cards.
I suppose it could be a freebie checkbox, since Nvidia has little interest in going through with an ECC and non-ECC controller just to keep marketing honest.


It's really a reflection of the fact that GK104 isn't meant for compute...while GK110 is.
That's why I'm asking why Nvidia is pushing it into that market. If Nvidia's slides are proven accurate in late 2012, K10 has about 2 quarters in the shadow of its bigger cousin.
Was it pushed there to meet an upgrade cycle, or to put something in the way of Tahiti-based Firestream cards, which could come out in the meantime?
The downside is that Nvidia's hobbled compute card can nibble at the shins of the market that could have or should have been served by GK110, making the big fish's pond a little smaller.
 
Btw., in a (German) interview of the ct with two nV guys, they basically insisted that separate FP64 units exist in GK110.
c't: Setzt Nvidia bei GK110 auf dedizierte Double-Precision-Einheiten oder arbeiten dafür mehrere Single-Precision-Kerne zusammen?

Danksin: Es gibt eigenständige Double-Precision-Einheiten.

c't: Nehmen diese viel Platz in Anspruch?

Alben: Zumindest nicht wenig. Dadurch belegt eine GK110-SMX deutlich mehr Die-Fläche als jene von GK104. Ein weiterer Platzfresser ist die Implementation der ECC-Funktionen.
They also said that the ECC implementation costs quite some space.

Edit: Google translate inverts the meaning of the first sentence of the second answer. ;)
 
^^ Bing works just fine :)

The seperate SP and DP execution units may cause same level of efficiency for Gaming SP loads compared to GK104.. It seems that NV kept simple SP EUs just to add 960 DP EUs.. GF110 EUs are more efficient than the ones on GF114.. so if the sitiuation would had been like Fermi era i would have guessed more than 2x Performance of GK104 at same clocks, but now it just ~80%, and after adding lower clock speed due to thermals it should be ~50%..
 
Yes there are. But you're sort of missing my point. Soft errors in on-chip SRAM is a far bigger problem than in DRAM. There is no point having ECC memory when your on-chip SRAMs do not have sufficient reliability. So while they might have ECC for DRAM, it's just for show.



It's really a reflection of the fact that GK104 isn't meant for compute...while GK110 is.

DK

I’d say that

if you don’t need ECC or DP (i.e. the workload / problem / math / don’t require it)
then you don’t need a TESLA branded board

You can get by with an ordinary/gaming card like 690

If you need some sort of special/ official support, ok you may buy just one T10…
And then buy dozens of gaming cards to implement the solution.
 
I know I'm late to the party, and this was posted a while back. Can you share any more info on your app? I've been doing some tinkering with Microsoft Hyper-V and RemoteFX platforms, specifically around the "shared" 3D acceleration that RemoteFX can deliver across multiple simultaneous streams via any 3D accelerator card that has a WDDM driver available. Even an ATI 4850 provides an acceptable amount of performance for 'easy' games across three separate Win7 VM remote sessions.

We have a 3d editor / render program (ie a regular application, not a service) that can also act as a webserver, so you can set up scenes, request pictures for them etc remote and live from a number of clients. It has a single d3d9 context which is shared among the webclients (only one can have the context thread and render a picture at a time).
Traditionally (ie on all current installations) we have been running with a local user who auto logins and starts a few instances of the program (for different applications or just to maximize throughput).
Given that we're not already totally gpu limited (ie reported gpu usage of 100%) we can squeze out more images/sec (on radeons) when benchmarking against 2 instances (with 2+ clients pr instance to hide image compression time) on the same graphics card. On the geforces it seems the context switching eats up any potential utilization gain.
Ofcourse we would also like to move to a proper remote/virtual context so we can avoid this ugly auto login thing, which is not allowed to go on screenlock, requires VNC (from the remote desktop) for maintenance etc.
 
I’d say that

if you don’t need ECC or DP (i.e. the workload / problem / math / don’t require it)
then you don’t need a TESLA branded board

You can get by with an ordinary/gaming card like 690
And how exactly do you fit all those Gaming Cards into your your system or rack mount server?

Tesla's have no external display connectors and thus have much better cooling or can even be fanless when installed into servers that have built in fans.

If you need some sort of special/ official support, ok you may buy just one T10…
And then buy dozens of gaming cards to implement the solution.
If you need the special software it is keyed to only run on quadro/tesla and will not run on gaming gpu's.

Tesla/Quadro's are also screened for the better parts whereas gaming gpu's are not. Would you really want to run your financial analysis on gaming gpu's that can/will experience an occasional hiccup in the graphics pipeline just to save a few $?
 
financial analysis? for this particular domain they'd better ask a donkey what to do (i.e. if it eats this carrot first, then buy, else sell). the results would be cheaper and more accurate.
 
financial analysis? for this particular domain they'd better ask a donkey what to do (i.e. if it eats this carrot first, then buy, else sell). the results would be cheaper and more accurate.

Well, anything that averages over 65/70% is considered good - and it would not be one of the most absurd system I've heard it is used...

Financial projects/libraries I know can either use SP or DP - however I wonder what would be the point of this TESLA card without on-chip ECC. Your weakest error-rate link is the error-rate you need to consider, and in the financial field... you simply do not want it.

Probably they can be good for lighting, where you can accept both error rate and precision loss... yet, for geometry such cards would not be good (autocad uses DPFPU since the start of ninethies or so).
...a marketing move against upcoming tahiti card?
 
Been wondering about the command processor of GK110 allowing it to launch new jobs, could it be the ARM core that project Denver 'promised'?
It certainly would be flexible enough for jobs described in the GK110 whitepaper.
 
Last edited by a moderator:
huum, maybe ( it will cost a lot on an gpu architecture anyway, maybe better to implement it on the whole system instead as do AMD with ARM right now to bring compatibility between x86 and ARM ( on memory side specially ) for be able to use ARM and x86 in a same system . ( this is on track, no idea when this interoperability will be seen, maybe 2013-2014 ). AMD and ARM have annonce it just less of 10 months ago they was working together on it.

Use some ARM part or implement it, will cost a lot in place, transistor, watts and if this is just for launch some command, better to dont have it in the core itself and be inactive 99% of the time . + this ask a lot of developpement for include it.
 
Last edited by a moderator:
Use some ARM part or implement it, will cost a lot in place, transistor, watts and if this is just for launch some command, better to dont have it in the core itself and be inactive 99% of the time . + this ask a lot of developpement for include it.
I don't think realize how many chips there are out there that have ARMs sprinkled all over the die. Obviously not Cortex A9s, but a simple ARM7TDMIS goes a long way. The area and power cost is next to nothing, probably less than 0.1mm2 in 28nm. Add 16KB of RAM and you're good to go: instead of a complex hard-wired state machine that requires a lot of effort to design and verify, you can move the complexity to software with the ability to continuously fix bugs. It's trivial to add such a thing to a chip. No 'lots of development for include it' at all, just the contrary. It doesn't necessarily have to be an ARM, of course, it could be MIPS, Tensilica, 8051, or the rare in-house developed microcontroller, but in all cases, it comes down to replacing an FSM by something programmable for very little area.

I've worked on chips with more 40 programmable micro-controllers, based off a common macro, with some custom hardware attached to it to accelerate particular instructions.
 
A1xLLcqAgt0qc2RyMz0y, If cost is a concern? White box servers - custom made. That’s the way to go for cheap performance. Anyway, as I stated above it depends on the actual ‘problem’, what you need and what you don’t.

imaxx,I remember the time when I had to use math coprocessor emulation, autocad 2.7, 2.9 or autocad 2.something.
 
Financial projects/libraries I know can either use SP or DP - however I wonder what would be the point of this TESLA card without on-chip ECC.

the question is, do we know it has no on-chip ECC?
I'd wager it has. it's non-trivial but not hugely expensive, R&D is already sunk and what you don't have from BigK is doubled L2 caches.

I believe I didn't read anything that said GK104 has no ECC.
 
Back
Top