Nvidia BigK GK110 Kepler Speculation Thread

With that shared memory configuration, I think accessing 8-byte primtive data type or any packed data type that have 8 byte in size will result in doubled shared memory bandwidth, thats why I said packed fp32.

I think that was Arun's point. You need to pack your 32-bit LDS accesses in order to maximize bandwidth. This was not the case on prior architectures so it's an indication that single precision took a back seat on Kepler.
 
I think that was Arun's point. You need to pack your 32-bit LDS accesses in order to maximize bandwidth. This was not the case on prior architectures so it's an indication that single precision took a back seat on Kepler.

You do *not* need to pack your 32 bit accesses, provided that the device is in 32 bit banking mode. Now, if you were to put it into 64 bit banking mode, you would in fact have to pack it for full bandwidth, but why would you do this unless you were using a lot of double precision to begin with?
 
You do *not* need to pack your 32 bit accesses, provided that the device is in 32 bit banking mode. Now, if you were to put it into 64 bit banking mode, you would in fact have to pack it for full bandwidth, but why would you do this unless you were using a lot of double precision to begin with?

From what I understand, this is not true. I believe you only get full bandwidth if 64-bit banking mode is turned on.
 
Nivida have describe this is not the case with their own memory controlllers .. it is just for here access some specific data ... it is not working the same way are wroking other memory controller.

I think you dont speak about the same things here guys . I absolutely dont understand why you are lost your time speaking about that .
 
Last edited by a moderator:
You do *not* need to pack your 32 bit accesses, provided that the device is in 32 bit banking mode. Now, if you were to put it into 64 bit banking mode, you would in fact have to pack it for full bandwidth, but why would you do this unless you were using a lot of double precision to begin with?

Do you have evidence to support that? nVidia mentions nothing of a 32-bit banking mode. They explicitly state that 8-byte accesses are required to maximize LDS bandwidth.

http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html

In balance with the increased computational throughput in Kepler's SMX described in Device Utilization and Occupancy, shared memory bandwidth in SMX is twice that of Fermi's SM. This bandwidth increase is exposed to the application through a configurable new 8-byte shared memory bank mode. When this mode is enabled, 64-bit (8-byte) shared memory accesses (such as loading a double-precision floating point number from shared memory) achieve twice the effective bandwidth of 32-bit (4-byte) accesses. Applications that are sensitive to shared memory bandwidth can benefit from enabling this mode as long as their kernels' accesses to shared memory are for 8-byte entities wherever possible.

@lanek - we're not talking about external memory controllers. We're talking about on-chip shared memory / LDS accesses.
 
http://videocardz.com/41297/nvidia-geforce-gtx-770-pictured

Pyz37QD.jpg


GK110 GTX 780 photo.
 
GK208 has been released??

I've googled "nvidia kayla", get here where it's said
NVIDIA® GeForce® GT640/GDDR5 (TO BE PURCHASED SEPARATELY) Buy Now
https://developer.nvidia.com/content/kayla-platform

"Buy Now" link to the ASUS GT640-1GD5-L which has 1GB gddr5 on 64bit bus, 384 units.
Card is shown here with mention of the GK208. TDP is 49 watts (not that low, but not that high)
http://www.techpowerup.com/gpudb/b2003/asus-gt-640.html
b2003.jpg


So, is that the tiny variant of GK110 we've got here, with CUDA compute capability 3.5?
 
GK208 has been released??

I've googled "nvidia kayla", get here where it's said

https://developer.nvidia.com/content/kayla-platform

"Buy Now" link to the ASUS GT640-1GD5-L which has 1GB gddr5 on 64bit bus, 384 units.
Card is shown here with mention of the GK208. TDP is 49 watts (not that low, but not that high)
http://www.techpowerup.com/gpudb/b2003/asus-gt-640.html
b2003.jpg


So, is that the tiny variant of GK110 we've got here, with CUDA compute capability 3.5?

Yes, it has been already released.

http://www.amazon.com/dp/B00CZ58XMA/
516%2BF3JZL2L._SX342_.jpg
 
Starting LU Decomposition (CUDA Dynamic Parallelism)
GPU Device 0: "GeForce GT 640" with compute capability 3.5

GPU device GeForce GT 640 has compute capabilities (SM 3.5)
Compute LU decomposition of a random 1024x1024 matrix using CUDA Dynamic Parallelism
Launching single task from device...
GPU perf(dgetrf)= 3.358 Gflops
Checking results... done
Tests suceeded
------------------------------------------------------------------------------
starting hyperQ...
GPU Device 0: "GeForce GT 640" with compute capability 3.5

> Detected Compute SM 3.5 hardware with 2 multi-processors
Expected time for serial execution of 32 sets of kernels is between approx. 0.330s and 0.640s
Expected time for fully concurrent execution of 32 sets of kernels is approx. 0.020s
Measured time for sample = 0.050s
 
https://devtalk.nvidia.com/default/...-319-and-linux-arm-support-for-discrete-gpus/

The next release from the 319.xx driver series will introduce support for the ARM architecture on Linux.

This new package provides display driver components built using the Thumb-2 instruction set. The kernel module and CUDA driver are built using the ARMv7 instruction set. All display driver components support thumb interworking and use HardFP.

This new ARM build has feature parity with other supported architectures.
 
Titan killed any demand for Dual GPU cards

http://www.tomshardware.com/reviews/radeon-hd-7990-crossfire-overheat,3539-3.html

Incidentally, two system builders independently reported that sales of all dual-GPU cards (including GeForce GTX 690) dropped to near-nothing once GeForce GTX Titan showed up.

Demand for Titan outstripped Nvidia’s ability to produce it. So, there is a healthy market for $1000 video cards. Enthusiasts simply don’t want to spend that much on cards that behave badly—regardless of whether they come from AMD or Nvidia.
 
At the same time, it dont say how much Nvidia was thinking product Titan pieces, when release it..

They need allready supply Quadro, Tesla K20-K20x and offcourse now the 780... The 7990 was not here ( not as it should sell well anyway ), and the 690 is out since a while, this market have allready be a real niche, as mostly peoples will buy 2x gpu for SLI or CFX.. they are still more practictable as thoses dual gpu's ( you dont have the same limitations on power, high temps etc )

In reality i have no idea of the availibility today of the 690 or even the 7990 in shops. this is the type of cards, shops dont have forcibly a lot in stock, they dont order much because they dont really sell well.

Anyway If they can release me a card with the same power ( theorically ) of 2x 7970(ghz), or 2x 770.. ill take it.
 
Last edited by a moderator:
One potential upside to an expensive single-GPU solution gutting SLI on a stick is that it might add an additional weight for additional scalability features or just more die area.
If they can't take a smaller chip with the usual ill-behaved scaling measures and use the AFR crutch to magic up sales, they might think the single chip should be just a little bit better, or maybe in some other lifetime they'll think of including methods that don't crap the bed as often.
 
The cache increase (4 times per partition, so in total doubled compared to gk107) though isn't nearly enough to compensate for the pathetic memory bandwidth of the ddr3 version.
Still this actually looks like a decent improvement. The gddr5 64bit gk208 is easily faster than the ddr3 128bit gk107 (as it should be as it has both higher core clock and more memory bandwidth, that is has less rops certainly won't matter with that kind of bandwidth, not to mention both rasterization and shader export are limited to 8 pixels per clock anyway), while using less power (despite the higher clocks and gddr5 - granted 64bit instead of 128bit probably compensates for that). With ddr3 though it's a bit too limited by memory bandwidth - beats fermi gt630 on shader heavy stuff but looses quite badly on some more bandwidth limited things. Granted at least perf/w is miles ahead against the fermi chip (unsurprisingly), though the difference to its gddr5 gt640 sibling doesn't quite seem to be as large as advertized (25W vs 49W TDP).
Also nice that you get 2GB gddr5 versions - must be using 4 4gb chips in clamshell mode.
 
Last edited by a moderator:
Back
Top