Nvidia BigK GK110 Kepler Speculation Thread

Well, once again, Kepler sacrifices registers/cache for flops, and sometimes that hurts performance a good bit. If you rewrite programs to take this under account, you can achieve very good results. Example: Understanding the Efficiency of Ray Traversal on GPUs – Kepler and Fermi Addendum. Clearly, this requires extra work and may not always be possible, but Kepler does have the potential for very high GPGPU performance.
I agree with you there. But really ideally you'd wanted something which generally gives good performance, without having to rewrite things manually too much. AMD's pre-SI chips could get very good performance for the right workloads too, but were (rightly so) criticized that performance for other workloads was just bad.
And as far as I know, none of these opencl/directcompute benchmarks do anything particularly stupid, so seeing Titan fall behind the "little" 7970 GE is quite disappointing, given it has a sizable raw power advantage (granted it's not _that_ big with SP).

GCN is probably easier to use, but since Teslas are outselling FirePros (to be best of my knowledge) the general feeling must be that NVIDIA's software is better. Whether things will remain that way is another story.
Looks like AMD did an awful lot right with GCN indeed. I think there's no question that Teslas are outselling FirePros (probably by an order of magnitude), but on the chip side of things this doesn't really seem justified.
 
Nope it's not a golden sample, it's actually linked to the way boost behaves on Titan.

If I measure power in Anno after 30s I get 220W but the cooling system (including the way it is calibrated) is not able to maintain 80°C with such a power draw (unless the system is in the fridge of course). After 5 minutes power drops to 180W. If I add extra cooling around the board power goes up to 200W.

So basically all is really linked too much on the temp. I dont understand why Nvidia have not just set a temperature limit of 90°C.
 
I agree with you there. But really ideally you'd wanted something which generally gives good performance, without having to rewrite things manually too much. AMD's pre-SI chips could get very good performance for the right workloads too, but were (rightly so) criticized that performance for other workloads was just bad.
And as far as I know, none of these opencl/directcompute benchmarks do anything particularly stupid, so seeing Titan fall behind the "little" 7970 GE is quite disappointing, given it has a sizable raw power advantage (granted it's not _that_ big with SP).

I think there's been a strategic shift on NVIDIA's side, and to an extent, on AMD's as well.

Sure, NVIDIA hopes that in time, automated tools will make it easier to tap into their GPUs' full potential, but in the meantime, it requires some extra effort. This is typically the sort of effort that HPC people are willing to put in, while developers of consumer applications might not be. Given general industry trends, this all makes sense. NVIDIA doesn't have any CPU technology, their overall graphics market share in PCs is small and shrinking, they're just never going to get CUDA to be relevant enough for developers to target it. They do, however, have the HPC-GPU market cornered (for now, anyway) so it makes sense to double down on it by maximizing FLOPS. Besides, they keep working on software and introducing features like HyperQ and Dynamic Parallelism, so it's not necessarily a net loss in ease of programming, and potentially a big gain in performance.

If you've been paying attention to NVIDIA's communication, you may have noticed that every NVIDIA marketing guy was constantly going on and on about CUDA in the GT200 days, but they hardly ever mention it now when talking about GeForces. They've given up trying to convince us that it matters to consumers or that it ever will. They haven't even bothered making Tegra CUDA-capable. To an extent, you might make the same argument for PhysX. The latter never looked very alive, but with GCN in the PS4 and Xbox 720, it sure as hell is quite dead now.

AMD, on the other hand, is less focused on HPC, and much more on parallel, or rather heterogeneous computing for consumers: GCN is meant to go everywhere, in tablets with Temash, in notebooks with Kabini, Kaveri and discrete, in consoles (PS4 & Xbox 720), in desktops with Kaveri and discrete, and of course in servers too (FirePros, in APU or discrete variants, maybe even ARM stuff at some point). Because it targets such a wide audience, it has to be easy to program, and it has to work well because it's one of AMD's very few advantages compared to Intel. It's also why AMD is pushing HSA so hard.
 
Nope it's not a golden sample, it's actually linked to the way boost behaves on Titan.

If I measure power in Anno after 30s I get 220W but the cooling system (including the way it is calibrated) is not able to maintain 80°C with such a power draw (unless the system is in the fridge of course). After 5 minutes power drops to 180W. If I add extra cooling around the board power goes up to 200W.

So not only can the fps be higher (max we've seen is 15% so far), but it can also throw up reported power differences of >50W. Dear oh dear. This is a tool for increasing benchmarking scores and while many in the press should be congratulated for their stances and commentary (PCGH, Hardware.fr and even Anandtech had mention of it of the ones I've read), I don't think it's really gone far enough.
 
I think they manufacture as many as they WANT to sell. At this price point.

Have you seen this:

2bqv6t.jpg


http://www.techpowerup.com/reviews/NVIDIA/GeForce_GTX_Titan/7.html

Looks like an error in the TPU chart. See this data point for comparison: http://images.hardwarecanucks.com/image//skymtl/GPU/GTX-TITAN/GTX-TITAN-64.jpg

Hardware Canucks appears to have an error or oddity in their Max Payne 3 data too. Notice that the 7970 GHz edition fps actually goes up when moving from 2560x1600 to 5760x1080: http://www.hardwarecanucks.com/foru...orce-gtx-titan-6gb-performance-review-14.html
 
Last edited by a moderator:
Seriously? A driver that craps out at lower resolutions but they comes alive and outpaces itself as the demand increases? A type-A personality graphics card?

LOL.

Don't you remember the X1800-related MSAA bug in OpenGL driver? MSAA 6× was significantly faster than MSAA 4×.
 
They haven't even bothered making Tegra CUDA-capable. To an extent, you might make the same argument for PhysX. The latter never looked very alive, but with GCN in the PS4 and Xbox 720, it sure as hell is quite dead now.

Lack of CUDA support in Tegra up to this point is likely due to the non-unified shader graphics architecture used, rather than a lack of interest from NVIDIA in promoting CUDA. NVIDIA's CEO did hint at CES that Tegra 5 will have CUDA support. And more and more universities are teaching CUDA. As for PhysX and game developer support, it appears that NVIDIA is focusing more and more on free-to-play games. For instance, the free-to-play Hawken and free-to-play Planetside 2 both have PhysX support.
 
Last edited by a moderator:
Don't you remember the X1800-related MSAA bug in OpenGL driver? MSAA 6× was significantly faster than MSAA 4×.

Plot the results from that review. Every card follows the rule of 1/fillrate drop-off right up until the magical maximum resolution when the 7970s jump UP instead of down.

How does a "bug" affect all but the highest resolution again?
 
Lack of CUDA support in Tegra up to this point is likely due to the non-unified shader graphics architecture used, rather than a lack of interest from NVIDIA in promoting CUDA. NVIDIA's CEO did hint at CES that Tegra 5 will have CUDA support. And more and more universities are teaching CUDA. As for PhysX and game developer support, it appears that NVIDIA is focusing more and more on free-to-play games. For instance, the free-to-play Hawken has PhysX support.

I know that Tegra lacks unified shaders, but that's the point. If NVIDIA thought GPGPU mattered to consumers, they'd give Tegra a unified shader architecture and make it compatible with CUDA and OpenCL, like Adreno, PowerVR and Mali.

Well, obviously those aren't CUDA-capable, but OpenCL.

Instead we're in this strange situation where the company that initially turned GPGPU into something real is the only major mobile GPU maker that doesn't support any form of mobile GPGPU.
 
Besides, it's just an upper bound, in practice you're often better off trying to increase IPC per work-item rather than increasing the number of work-items (it's not as straightforward, but yields better results). See Vasily Volkov's work, e.g. this: Better Performance at Lower Occupancy. And doing this is not easy if you're short on registers.

Yes I know that paper well and it actually supports what I'm saying. You have to strike a fine balance between register usage and number of inflight threads. However, after a certain point peak flops is no longer limited by register availability because registers are obviously reused during execution.

Avoiding divergence and coalescing memory accesses also play a big part. That's why it's hard to just look at a bar graph and understand why a card performs at a certain level.

With respect to hand-tuning code for maximum efficiency I'm pretty sure you have to do the same thing to get the most out of SSE and AVX on CPUs. Auto-vectorizing compilers can only do so much with code that wasn't written with SIMD in mind.
 
Hehe, the card is obviously built to feel the best in Canada, North Europe or some parts of Russia. :LOL:

Seriously tho, it is a nice mechanism to extract in a natural way some performance. But obviously, it won't be as pleasant if you keep the stock cooler and if you live in a warm environment.

Yup, depending on game, some games don't utilise the chip so much, others can burn it.
 
I know that Tegra lacks unified shaders, but that's the point. If NVIDIA thought GPGPU mattered to consumers, they'd give Tegra a unified shader architecture and make it compatible with CUDA and OpenCL, like Adreno, PowerVR and Mali.

With respect to Tegra 4, it was a matter of prioritization. NVIDIA clearly felt it was more important to maximize performance and minimize cost within a target die size area than to add unified shaders and CUDA support. Anyway, it will be coming next year, so better late than never.
 
Lack of CUDA support in Tegra up to this point is likely due to the non-unified shader graphics architecture used, rather than a lack of interest from NVIDIA in promoting CUDA. NVIDIA's CEO did hint at CES that Tegra 5 will have CUDA support. And more and more universities are teaching CUDA. As for PhysX and game developer support, it appears that NVIDIA is focusing more and more on free-to-play games. For instance, the free-to-play Hawken and free-to-play Planetside 2 both have PhysX support.

Huum lets be honest too.. most software / OS on mobile side ( include internet browser and video encoding / decoding ) just use OpenCL.. There's not too so much power available and yet interest on it outside maybe use it for apply HDR filter on camera.. Adding CUDA on their SOC GPU ( who are not Cuda core and unified shaders ) at my sense is not really a good idea or have much sense with todays mobile ecosystem ( Android ) . This will maybe change in near future, but at this moment i dont see any sense for Nvidia to do it.

As for Hawken and PlanetSide ( physX have been removed yet, need wait a patch ), and hawken is more a beta today of anything else. I dont think they want support or focus free to play games, but it is just the only games they had not advert yet ( last was BL2 and yet most use PhysX in medium with Nvidia cards as there's still some work to do ( major fps down on some part for " no reason" ).
 
Last edited by a moderator:
Yes I know that paper well and it actually supports what I'm saying. You have to strike a fine balance between register usage and number of inflight threads. However, after a certain point peak flops is no longer limited by register availability because registers are obviously reused during execution.

Yes, but if peak flops is no longer limited by register availability per thread, then it's probably limited by thread count, which means you have to increase occupancy.

But if you do, you're going to see a linear increase in the necessary register count. The problem is that compared to a GF100 SM, a GK104 SMX has 6 times the FLOP/cycle rate, and only twice the registers.

Say you have a piece of CUDA code developed for Fermi, and you want to port it to GK104 (or GK110, for that matter).

If you can increase ILP by a good bit, and then maybe occupancy, you'll be fine, you should have enough registers.

If you can't increase ILP, but you were only really using 1/3 of your registers on Fermi, that's fine, multiply your thread count by 6, you'll fill your register file but it's OK, you'll get the same efficiency.

If you can't increase ILP and you're already using more than 1/3 of your registers on Fermi, then by all means increase thread count, it's all you can do and it will help, but you're going to see some cache spillage and you won't get the same efficiency. The problem is that this is probably what most coders tend to do, or at least the way that applications originally developed for Fermi will behave on Kepler.

Avoiding divergence and coalescing memory accesses also play a big part. That's why it's hard to just look at a bar graph and understand why a card performs at a certain level.

With respect to hand-tuning code for maximum efficiency I'm pretty sure you have to do the same thing to get the most out of SSE and AVX on CPUs. Auto-vectorizing compilers can only do so much with code that wasn't written with SIMD in mind.

No argument there.
 
Wait, in summary:

Cuda compute capability 3.x devices have 4 warp schedulers. This means that each SM can execute 4 warps at clock. Each of these schedulers can also issue 2 instructions at once.
This means, as you said, 256 MADs. But I dont get what those execution ports are...??

There's a crossbar between the 6 ALUs and the instruction schedulers, meaning that while each thread can in theory issue two instructions per cycle, only 6 can be issued in total.

This makes a lot of sense from an efficiency standpoint, since most of the time you won't have two independent instructions on each thread. By under-provisioning the ALUs, you drop the mostly idle ALUs, which gives you extra space for more cores and such overall.

Kepler has different throughputs of different types of instructions, something like 6 types overall. It's possible that each of these instruction types has its own independent set of pipelines, which could in theory make it possible to issue all 8 instructions, provided that no more than 6 of them are FP, though there's no solid information on whether the architecture works like this.
 
Nope it's not a golden sample, it's actually linked to the way boost behaves on Titan.

If I measure power in Anno after 30s I get 220W but the cooling system (including the way it is calibrated) is not able to maintain 80°C with such a power draw (unless the system is in the fridge of course). After 5 minutes power drops to 180W. If I add extra cooling around the board power goes up to 200W.

That's very interesting and unfortunate for anyone spending the money for this card.

5 minutes is probably enough for most benchmarking sites to get their numbers. But if it operates at that lower power level for the vast majority of the game, then most review sites benchmarks could be highly misleading with regards to the actual performance of the card in relation to not only the competition but Nvidia's own cards.

I'm sure they put this in with the best of intentions, but it has the potential to be very misleading to anyone looking at reviews to get an idea of how this card performs.

Thanks for bringing this to light.

I didn't like boost 1.0 (don't like AMD following suit) and boost 2.0 just seems to be even more of a train wreck.

I'd much rather have a card with consistent performance, and if the user chooses, they can overclock if they wish.

Woo, my card runs super fast when I first start the game, but as soon as anything interesting happens temps go up and suddenly my card isn't quite so fast in games anymore. Bleh.

Regards,
SB
 
I didn't like boost 1.0 (don't like AMD following suit) and boost 2.0 just seems to be even more of a train wreck.

In my case boost 2.0 would give me higher clocks. My card doesn't go over 58 degrees and its dead silent. Yet nVidia's drivers limit clocks and voltages based on TDP. Since I don't have a temperature or noise problem I would much rather they allow higher clocks till temperatures are closer to 70/80 degrees.

Variability on 2.0 is certainly higher but you probably get higher clocks overall.
 
I didn't like boost 1.0 (don't like AMD following suit) and boost 2.0 just seems to be even more of a train wreck.

AMD's Turbo is deterministic, it's not temperature-dependent, and it doesn't vary from card to card: for a given workload, you always get the same performance for every card that isn't in an oven.

It may have other drawbacks, but not those.
 
Back
Top