Nvidia BigK GK110 Kepler Speculation Thread

The titan is 33% faster of a 680. 25-27% faster of a 7970ghz. if this card take place at 10-15% performance under Titan.. it will not stay a lot of margin for a 680 refresh.

I thought the performance was abit higher than 33% and 25~27% vs GTX680 and 7970GHz Edition respectively?
 
The titan is 33% faster of a 680. 25-27% faster of a 7970ghz. if this card take place at 10-15% performance under Titan.. it will not stay a lot of margin for a 680 refresh.
And with such an high price.. i dont see them start a GTX 700 series with performance close of the 680 at a 550$ price point.

Most reviews show Titan to be 45%+ faster than gtx680, and 33%+ faster than 7970GE.

Anandtech probably has the most fair review and even has several AMD-friendly games in their benchmarking suite showing 34% faster on average: http://www.anandtech.com/show/6774/nvidias-geforce-gtx-titan-part-2-titans-performance-unveiled
Hardocp also shows 28-43% faster in FC3, 37-40% faster in Hitman, 8% faster in Sleeping Dogs, 25-45% faster in Max Payne, and 47% faster in BF3: http://www.hardocp.com/article/2013/02/21/nvidia_geforce_gtx_titan_video_card_review/#.UUkGBRysh8E
Techpowerup shows Titan is 32% faster at 2560x1600: http://www.techpowerup.com/reviews/NVIDIA/GeForce_GTX_Titan/27.html

Your 25% figure is definitely low balling Titan's performance vs. 7970GE.
 
Most reviews show Titan to be 45%+ faster than gtx680, and 33%+ faster than 7970GE....
Your 25% figure is definitely low balling Titan's performance vs. 7970GE.
33% for the 7970GE (@ 1920x1200/1080 and 2560x1440/1600) is pretty much on the money.
From the 28 site reviews (the single highest game i.q. setting per resolution per review only)
OrL8Hu7.jpg
 
GK110 technology finally made it's way into smaller devices. The GPU on the Kayla-Devkit for the Logan-SoC has SM3.5, Dynamic parallelism and more: A D15M2-20 GPU with 2 SMX. For a mobile device, I'd say: WHOA!
http://www.pcgameshardware.de/GTC-E...bislang-geheimer-Kepler-GPU-D15M2-20-1061349/

Wow, where is this chip coming from?
Alright, it's a dedicated GPU, probably what we sometimes referred to as "GK208", and Logan is a future Tegra made with Cortex A15 and Kepler.. I had no idea they were going to do this :)
It looks like they intend to compete with Jaguar more than with cell phone SoC, with that one.

Unusually, there's information on Wikipedia, Logan would be released in Q2 2014 and Denver+Maxwell Tegra in 2015.
I thought Maxwell would be "Tegra 5", but the next gen silicon process is so incredibly hard to develop that there's enough time for an interim product, it seems.

I thought GK208 would have only 1 SMX.. Or they might release a GK107 or GK208 card with 1 SMX disabled. Anyway I'd want it for low power, cheap desktop linux PC.
 
BTW, from the screenshot shown on the german article, I can say this thing is running on Ubuntu 12.04 :runaway:
The panel shown is the usual nvidia-settings, and it says Xorg 1.11 is running.

In theory, Valve could launch Steam on that platform (only needing linux games to be ported to ARM, all other things equal). In practice, you can at least fully use the future Tegra and successor as a classic desktop/laptop, and it runs the same driver as on Windows x86.
 
http://www.youtube.com/watch?v=A84v7lbdcYg

http://www.youtube.com/watch?v=5d1ZOYU4gpo

Jen-Hsun's GTC 2013 keynote video for those that didn't watch the livestream is now on Youtube.

WaveWorks and FaceWorks Demos running on TITAN.

Faceworks is amazing. That avatar combined with a really advanced AI routine linked into something like google search and good voice recognition could make your computer seem like one of those living AI's from a sci-fi movie. I want it!!
 
Faceworks is amazing. That avatar combined with a really advanced AI routine linked into something like google search and good voice recognition could make your computer seem like one of those living AI's from a sci-fi movie. I want it!!

If it can cook, I want it too!
 
Faceworks is amazing. That avatar combined with a really advanced AI routine linked into something like google search and good voice recognition could make your computer seem like one of those living AI's from a sci-fi movie. I want it!!

Yes we all remember how awesome Matrox's headCasting was :D
 
Doing some non-trivial CUDA programming on GK110 now, I have to say, Fermi is a very efficient design, it is much easier to achieve optimal efficiency on Fermi than on GK110.

For GK110, ILP has to be at a pretty good level to obtain maximum efficiency, thats not a easy task for non-trivial applications.

But the reward sometimes justifies the effects, after carefully tunned, cuda codes on GK110 can achieve very significant speed-up comparing to running them on CPU or even MiC, etc.

Its really remind me the good old days when people do programming with machine codes or assembly, Kepler is really a rough and low-level system, it gives you so much more rooms to achieve maximum efficiency, or to mess things up.
 
Even Acceleware did not really pay attention to ILP in their optimization course on Kepler at GTC - in their occupancy example they used an algorithm without any ILP, thus achieving only ~2/3 of nominal throughput.
 
Sorry for the late reply - LiXiangyang, are you still focusing on FP32? My understanding is that NVIDIA basically decided to only optimise GPGPU around FP64 and don't care much (enough?) about Kepler's efficiency for FP32 GPGPU. In addition to the lack of ILP requirements, one obvious example is shared memory working in 64-bit bursts now and therefore requiring Vec2 accesses in FP32 mode while it's trivial to use in FP64...
 
Not really, GK110's fp32 performance is considerably better than its fp64 counterpart, not just because the raw flops outputs, but also because if fp64 is disabled, then Gk110 can overclock itself quite a bit (for my case, when load is high, and when 64 bit mode is disabled, GK110 will overclock itself around 980MHz), as for the double bandwidth of shared memory in 64 bit mode, that means, in terms of bandwidth measured by number of variables that can be accessed at any given cycle, fp32 is twice that of fp64 if you use packed fp32 access alot.
 
one obvious example is shared memory working in 64-bit bursts now and therefore requiring Vec2 accesses in FP32 mode while it's trivial to use in FP64...

Eh, how many half-warps are (theoretically) reading from LDS each cycle?
 
as for the double bandwidth of shared memory in 64 bit mode, that means, in terms of bandwidth measured by number of variables that can be accessed at any given cycle, fp32 is twice that of fp64 if you use packed fp32 access alot.

Kepler has configurable shared memory banking, so you can either use 32 bit banking or 64 bit, as set with cudaDeviceSetSharedMemConfig(). Hence, you can either configure it for fp32 or fp64.
 
With that shared memory configuration, I think accessing 8-byte primtive data type or any packed data type that have 8 byte in size will result in doubled shared memory bandwidth, thats why I said packed fp32.
 
Back
Top