NVIDIA Maxwell Speculation Thread

DX12 should work on all Kepler an Maxwell chips, just not with all the fancy features, but all the CPU side performance improvements.
 
DX12 should work on all Kepler an Maxwell chips, just not with all the fancy features, but all the CPU side performance improvements.

All the "CPU side performance imprevements" work on Fermi & up, GCN and up and Intel 7th or 7.5th gen and up (can't remember exact Intel gen for sure)
 
Much of the superiority of gm204 is related to its high clocks even with its increased IPC.

Nvidia increased IPC(this I get from Ryan's explanation at ANAND)and increased its clock by 25%(how did they manage this?) and somehow, it didn't increase power consumption.

IPC is about the same as previous gen, clocks were increased by 25% and 32% for the 980 and 970 respectively, using the former as exemple with 2048 EUs you ll get the equivalent of 2048 x 1.25 = 2560 EUs at reference frequency, and with likely better scaling than increasing the EUs count.
 
IPC is about the same as previous gen, clocks were increased by 25% and 32% for the 980 and 970 respectively, using the former as exemple with 2048 EUs you ll get the equivalent of 2048 x 1.25 = 2560 EUs at reference frequency, and with likely better scaling than increasing the EUs count.

You have to remember that Kepler's ALUs can only be fully utilized at IPC > 1.5 in the code (4 threads can dual issue, but only 6 ALUs behind it). This means that the 2560 case represents a best case for Kepler where the workload has lots of available ILP.

In the real world, Kepler does worse, since a lot of the time there's no other instruction that can be dual issued. This is why GTX 980 outperforms GTX 780Ti much of the time (and sometimes significantly), even though the 780 has 2880 ALUs.

In worst case Kepler code, we would have 2048 Maxwell ALUs * 1.25 clock * 1.5 from missing IPC on Kepler = 3840 Kepler ALUs. This is closer to what we see in some compute benchmarks.

(Note: This is disregarding the memory system and assuming compute bound code. With memory bound... I have no idea. Kepler has more bandwidth, while Maxwell has a bigger cache. Depends on the workload!)
 
Last edited by a moderator:
You have to remember that Kepler's ALUs can only be fully utilized at IPC > 1.5 in the code (4 threads can dual issue, but only 6 ALUs behind it).
Kepler was actually limited by its operand collector, which can only fetch 3 registers per clock. That's OK if you're dual issuing say an add and a store or SFU. Or if you're doing two-argument ALU ops like adds, the collector can amortize the "spare" register and dual issue every other clock keeping all 6 ALUs busy. Kepler's SMX design tries to maximize the chance of dual issue, at the expense of potentially idle ALUs.

But leads to inefficiency if you're executing a 3-input FMA requiring all three register inputs, leaving no chance for any dual issue. (And as you note, that case leaves 1/3 of your ALU's idle.) So the IPC becomes crucially dependent on the code having a low FMA density. That density in practice may be pretty high for both graphics and GPGPU.

Maxwell simplifies this. There's only one ALU per scheduler, so it can either dual issue an ALU op and a store/SFU, OR it can do an FMA. In both cases all the ALUs are always occupied. So it forgoes some dual-issue chances, but keeps all its ALUs busy.

In hindsight, Maxwell's "use all the ALUs efficiently" design was a better design than Kepler's "use all the schedulers efficiently". It probably was not so obvious back in 2009 or so when Kepler was designed.
 
8GB 980 GTX looming .... latest rumour.

Sales strong NVIDIA's new GPU "GeForce GTX 980/970" loading graphics card. Video memory is doubled over the November-December GDDR5 8GB version is scheduled to launch of "GeForce GTX 980", ready each company seems to be going well.
In addition, the appearance scheduled for 2015 Q1 for "GeForce GTX 960" based graphics card. While the moment details undecided, that of the "December there may be a release of the last minute".
http://www.gdm.or.jp/voices/2014/1029/90795
 
Agreed. I think the 8gb model is targeting the gamers with 4k monitors, but should also be useful for gamers using DSR to downsample 4k resolutions to their current resolution.
 
I haven't seen a recent game hit even 3GB VRAM utilization even with DSR 2720x1536 or MSAA/SSAA. 4GB seems the ideal amount to pay for with current GPU performance.
 
I think that 8gb in gtx980 is useless
An 8GB GTX 980 will make many GPGPU developers very happy!


Though that would be overshadowed by the chance that the Tesla M20 will be announced or perhaps even released on November 16, the first day of Supercomputing 2014. K20 launched at SC 12, though was announced 6 months earlier.
 
I haven't seen a recent game hit even 3GB VRAM utilization even with DSR 2720x1536 or MSAA/SSAA. 4GB seems the ideal amount to pay for with current GPU performance.

Playing at 5920x1080 i have many games to completely use up all the 4 GIGs of my R290.
The last one is been playing doing this were Star Point Gemini, Skyrim (yeah texture mods), Star Citizen but there are many more.
 
I haven't seen a recent game hit even 3GB VRAM utilization even with DSR 2720x1536 or MSAA/SSAA. 4GB seems the ideal amount to pay for with current GPU performance.

Wolfenstein uses 3GB VRAM (max detail @ 1920x1080).
Ryse uses even more, 3.4 GB VRAM (max detail without ssaa @ 1920x1080).
 
I've seen over 6 (before a patch even over 8) Gigabytes of VRAM used in Watch Dogs - Ultra-HD, Ultra-Textures and 8x MSAA. :)
 
Edit: Woops, I was assuming that the new color compression methods improved memory utilization efficiency and not just bandwidth. That wouldn't be the case?
 
Edit: Woops, I was assuming that the new color compression methods improved memory utilization efficiency and not just bandwidth. That wouldn't be the case?
It wouldn't be the case if you still need random access to the frame buffer. Which, I think, you still need.
 
I've seen over 6 (before a patch even over 8) Gigabytes of VRAM used in Watch Dogs - Ultra-HD, Ultra-Textures and 8x MSAA.
Maybe that's why it stutters so much.

MSAA is so useless these days. I experimented with all of the options in Watch Dogs the other day and none of them are particularly impressive. I think TXAA is perhaps the most effective though. Sometimes I feel like just not using AA at all instead of all the halfway effective options.
 
Edit: Woops, I was assuming that the new color compression methods improved memory utilization efficiency and not just bandwidth. That wouldn't be the case?

It could make it somewhat worse. Lossless compression will always have data it cannot compress, and then it must store at least some extra data saying it could not be compressed.
Due to fluctuating compression rates, the safest course would seem to be allocating for worst-case consumption, rather than finding out at a bad time that there's no room for the overflowing framebuffer.
 
You don't have to store data in the uncompressed case. The decompressor could look for block compression headers. If they're not there, pass the block on unchanged because it's uncompressed.
 
You don't have to store data in the uncompressed case. The decompressor could look for block compression headers. If they're not there, pass the block on unchanged because it's uncompressed.
What if the uncompressed pixels happen to have the same value as the headers?

You can't escape the pigeon hole principle...
 
Back
Top