Nvidia BigK GK110 Kepler Speculation Thread

So is it really feasible than they can do "almost 2x gk104" in twice the transistors with added caches, added fp64 speed etc? I really don't see that happening
 
As Alexko already noted amongst others there's cache missing compared to GF110; but in that regard GK104 doesn't have any significant differences compared to GF1x4. GK104 has the downside though that it packs a crapload of more SPs within each cluster compared to GF104 (192 vs. 48) if you come to think of cache amounts.
The loss of the hotclock makes that just a 1:2 ratio of the ALU throughput. And the L2 cache bandwidth or the speed for atomics actually went up with GK104 compared to Fermi (bandwidth about +70% compared to GF110, about +150% compared to GF114, atomics quite a bit more). I don't think the culprit can be found in the 33% lower size of the L2.
At the same time, the requirements of the DirectCompute CS5.0 and also OpenCL limit the usable size of the L1 cache to 16kB for Fermi because it can't be partitioned 32kB : 32kB, while GK104 allows that. That enables twice the L1 cache per SM(X) (32kB instead of 16kB) reflecting the doubling of computing throughput per clock (but I don't know if it is used or if GK104 defaults for a larger shared memory region, considering it has to be shared by more threads).

The last point is a significant regression for compute (they should have doubled the L1/shared memory to 128kB, GCN has 64kB dedicated shared memory [+16kB L1] for just 64 ALUs, not 48kB max for 192 ALUs, GK110 will probably do it). And connected to that, the L/S capabilities are about halved per ALU (but overall still a plus considering the clock speed). Then you have of course the much more static scheduling reducing performance (but that shouldn't have that much of an effect).
Another thing which comes to my mind is the quite low performance for some instructions (integer adds and logic operations are fine [albeit not full rate], integer multiplies/mads are somewhat okay with the 1:6 ratio). But just having a loop counter in a normal sized to compact loop is probably slowing down GK104 considerably (it has only a 1:24 ratio for compares! was 1:2 or 1:3 with Fermi). So it's not only the shift performance, which is lacking in GK104.
 
The last point is a significant regression for compute (they should have doubled the L1/shared memory to 128kB, GCN has 64kB dedicated shared memory [+16kB L1] for just 64 ALUs, not 48kB max for 192 ALUs, GK110 will probably do it). And connected to that, the L/S capabilities are about halved per ALU (but overall still a plus considering the clock speed). Then you have of course the much more static scheduling reducing performance (but that shouldn't have that much of an effect).
Isn't the LDS size [per thread-block] limited by the API-mandated exposure, anyway? I'm not sure for OCL, but DC5.0 requires a fixed size of 32KBytes. Does OCL set only a lower limit?

Of course, there's always a way to increase the aggregate LDS size by simply stuffing more multiprocessors in the GPU, without hitting those API limitations.
 
Isn't the LDS size [per thread-block] limited by the API-mandated exposure, anyway? I'm not sure for OCL, but DC5.0 requires a fixed size of 32KBytes. Does OCL set only a lower limit?
OpenCL requires a minimum of 32KB of LDS. Nvidia exposes 48KB in OpenCL.
 
Isn't the LDS size [per thread-block] limited by the API-mandated exposure, anyway? I'm not sure for OCL, but DC5.0 requires a fixed size of 32KBytes. Does OCL set only a lower limit?
Does it matter?
What I was hinting at, what happens if you set a work group / thread block size of 32 and use the full 32kB per work group? Only a single warp is going to run on a whole SMX!
Of course, that's a rather extreme example. But generally, the total size of the local memory may limit the number of work groups which can be scheduled to run simultaneously on a CU (at least when using a bit more local memory), that means it limits the amount of threads. IIRC, GCN exposes only 32kB of local memory as the maximum allocation of local memory per work group (and not the full 64kB, which means you can run at least two parallel work groups on each CU). The local memory size per work group is normally not a serious limitation.
 
So, 690 wasn't BigK, it was dual 680. How utterly boring:

"In total, the GTX 690 has 3,072 CUDA cores, running at a 915MHz base clock and 1,019MHz boost clock, both slightly reduced from the standard values on a standalone GTX 680 graphics card. The memory clock is unchanged at an effective rate of 6GHz, and you get two batches of 2GB of GDDR5 RAM, totalling 4GB of dedicated video buffer. $999"

So, 100Mhz less and the same price as 2x 680s.
 
Yep pretty boring all in all. Looks like we'll have to wait a while for BigK. That's if it's a going to be a consumer level gaming GPU anyway.
 
The question is, if 690s are $1000, and one 580mm^2 die costs more than two 290mm^2 dies due to yields, and 512bit interface means 4GB, same amount as on a 690, then... Does "BigK" end up costing $1000 or more?
 
The big chip would be targeting compute and professional graphics segments.
$1000 would be way too low for a top-end Quadro.
 
The big chip would be targeting compute and professional graphics segments.
$1000 would be way too low for a top-end Quadro.

So you think its a compeletly professional/compute product with no consumer desktop component? That was my original thought. Though the buzz seems to be "wait for BigK its going to blow away the 680" ect...
 
From how it's described, it seems like it is going to be tailored to fit those markets.
It might still fit into a high-end enthusiast single-GPU product as long as it doesn't lose to a single 680.
The transistor count should give it plenty to work with, and Nvidia has left the upper TDP range empty, which in these power-limited scenarios means a chip running in that range should be able to win.

The 690 card does mean that the big chip may not have the top gaming bracket.
 
I wonder if nv will try a little experiment. release it as a quadro only product and see if the high end gamers buy if.
 
So, 690 wasn't BigK, it was dual 680. How utterly boring:

Pretty much exactly what I expected though. BigK is unlikely to show up before the 7xx series which is probably slated for the fall or winter quarter.

So 680 will be the top single chip solution while 690 will be the top card solution for the 6xx line.

Heck, it wouldn't even surprise me if BigK was relegated to the ultra enthusiast (~1000 USD) segment when it launches in the 7xx series with the chips focus being on prosumer/professional/HPC markets the consumer space will just be there for inventory bleed off and/or salvage parts. With that Nvidia using smaller dies tailored for consumer use fitting everything from 780 on down. I certainly wouldn't be surprised if Nvidia abandoned the big die strategy for the consumer space.

Now to see what Nvidia comes up with in the lower segments.

Regards,
SB
 
I wonder if nv will try a little experiment. release it as a quadro only product and see if the high end gamers buy if.
They would have to release GeForce drivers for it. The Quadro drivers aren't exactly performant (never mind the update schedule).
 
Heck, it wouldn't even surprise me if BigK was relegated to the ultra enthusiast (~1000 USD) segment when it launches in the 7xx series with the chips focus being on prosumer/professional/HPC markets the consumer space will just be there for inventory bleed off and/or salvage parts. With that Nvidia using smaller dies tailored for consumer use fitting everything from 780 on down. I certainly wouldn't be surprised if Nvidia abandoned the big die strategy for the consumer space.

GK110 doesn't sound like it'll appear for desktop all that soon. If by that time 28nm yields/capacities and in extension manufacturing costs haven't normalized it won't be good news for both AMD and NVIDIA for desktop sales (well it'll be most likely high margins, low volume).

As for NV abandoning the big die strategy in some way for desktop I wouldn't be much surprised either in the longrun, but for the time being it doesn't seem likely that professional market sales (despite big margins) can absorb the R&D expenses for such a high complexity chip.
 
I'm not sure if this has been pointed out before in the long Kepler thread… but someone at the SemiAccurate forums noted that the Kepler GPUs for the Oak Ridge upgrade will have 6 GB memory. That seems to indicate the GK110 will have either a 384-bit bus or a 512-bit bus that's disabled to 384-bit on the particular cards they'll use.
 
I'm not sure if this has been pointed out before in the long Kepler thread… but someone at the SemiAccurate forums noted that the Kepler GPUs for the Oak Ridge upgrade will have 6 GB memory. That seems to indicate the GK110 will have either a 384-bit bus or a 512-bit bus that's disabled to 384-bit on the particular cards they'll use.

Nice find. I doubt they would saddle such a high profile deployment with salvage chips so maybe it's 384-bit.

GK110 doesn't sound like it'll appear for desktop all that soon.

Damn them all to hell.
 
Meh, I imagine they could sell them at $799 and still make decent coin... Only reason I can see for delaying consumer availability would be supply constraints (which is probably an issue); why sell for $799 when you can sell the same chip for much much more.
 
Back
Top