NVIDIA Maxwell Speculation Thread

Output of 32-bit DeviceQuery:

deviceQuery.exe Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 750"
CUDA Driver Version / Runtime Version 6.0 / 6.0
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 1024 MBytes (1073741824 bytes)
( 4) Multiprocessors, (128) CUDA Cores/MP: 512 CUDA Cores
GPU Clock rate: 1137 MHz (1.14 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 6.0, NumDevs = 1, Device0 = GeForce GTX 750
Result = PASS
Source

Why the shared memory is still reported at 48K?
 
Shared memory per block is limited to a smaller value than the total shared memory capacity per SM.
 
Nice catch, I wonder what happened there. Do you know of similar figures for GCN chips? Or older GPUs, for that matter?
I have no idea: it's the first time I've seen these latency graphs for a GPU. Very surprising to see them on TomsHardware of all places.
 
As I said, the GM107 chip seems great. It's the MSRP of these cards that made them mediocre. Like always, there are no bad products, there are bad prices.


But please, feel free to explain how this logic is "elitist".

Sorry. I should have come and corrected my post. True, it's overpriced, meaning having to wait six monthes or a year. Nvidia pulled the same thing with GTX 650.
Nvidia does that all the time, introducing new GPUs at a high price, then it gets priced down when AMD can compete or when the former nvidia generation is EOL. AMD introduces stuff at more affordable prices from the start.
 
You've either said far too much or far too little. Far too little for my taste. :D How is the result bogus? How will it be addressed? Why the quotation marks?

The Sandra latency benchmarks have been questionable before.
Their L1 numbers for AMD's VLIW GPUs are about double latencies measured in arithmetic loops, for example. The numbers were still very high, just not that high.
I haven't come across a reason why that was the case, or why this pattern of overestimation persists with the 270.
 
Can i ask you, gm107 has DP dedicated cores? Because here i read this..:

We're not sure what to think about GM107's increasingly hobbled FP64 capabilities. You can either say double-precision performance is really bad, or the single-precision numbers are really good. Regardless, at the end of the day, artificial limitations meant to prevent cheap desktop cards from being viable workstation parts are no less irritating.

http://www.tomshardware.com/reviews/geforce-gtx-750-ti-review,3750-16.html
 
From what I've understood so far there are 4 FP64 SPs in each SMM. With 5 clusters you'd have 20 FP64 SPs for the GM107. It's not much more compared to the 16 SPs in GK107 but it isn't less either.

And I'm obviously missing something in that toms' link since the DP results they're showing are anything but bad for such a humble GPU especially if I consider how damn expensive a Quadro K5000 is and where that one lands compared to a 750Ti. If the entire thing is based on the 1:24 vs. 1:32 thingy then it's rather nonsense. With 2.5x times more clusters than a GK107 it isn't necessarily an absurd design decision, always of course under NV's typical reasoning for those kind of things.
 
From the wording of Tomshardware ('artificially hobbled') it's almost as if they believe that gm107 has more DP units than reported.
 
From what I've understood so far there are 4 FP64 SPs in each SMM. With 5 clusters you'd have 20 FP64 SPs for the GM107. It's not much more compared to the 16 SPs in GK107 but it isn't less either.

And I'm obviously missing something in that toms' link since the DP results they're showing are anything but bad for such a humble GPU especially if I consider how damn expensive a Quadro K5000 is and where that one lands compared to a 750Ti. If the entire thing is based on the 1:24 vs. 1:32 thingy then it's rather nonsense. With 2.5x times more clusters than a GK107 it isn't necessarily an absurd design decision, always of course under NV's typical reasoning for those kind of things.
You have to keep in mind this test doesn't really depend on DP rate all that much. If you compare Titan to 780Ti for instance the former is faster yes but merely by 40% or so.
That said though you'd think at least all the low DP performance chips (that is those with 1:16-1:32 DP:SP performance) would rank somewhere close to their DP capability (since you'd think with such low DP performance other bottlenecks would disappear) but that's not the case neither. The 750Ti indeed is still very close to HD 7790 even though the latter has more than twice the DP flops (though the 7790 does make up some ground there compared to the single precision results where the 750Ti beats it).

From the wording of Tomshardware ('artificially hobbled') it's almost as if they believe that gm107 has more DP units than reported.
I think for technical analysis you shouldn't trust Tomshardware :).
 
From what I've understood so far there are 4 FP64 SPs in each SMM. With 5 clusters you'd have 20 FP64 SPs for the GM107. It's not much more compared to the 16 SPs in GK107 but it isn't less either.

And I'm obviously missing something in that toms' link since the DP results they're showing are anything but bad for such a humble GPU especially if I consider how damn expensive a Quadro K5000 is and where that one lands compared to a 750Ti. If the entire thing is based on the 1:24 vs. 1:32 thingy then it's rather nonsense. With 2.5x times more clusters than a GK107 it isn't necessarily an absurd design decision, always of course under NV's typical reasoning for those kind of things.

Ok, thank you very much.
 
1. Nvidia confirmed to me it is 12 KiB per 4 TMU blocks. I actually asked them if that wouldn't be a perf issue for some compute task but they feel it shouldn't be an issue for THIS particular GPU. Plus don't forget that register pressure will be lowered a little bit thanks to the shorter arithmetic pipeline.

rpg.314's point about L1 misses mainly going to L2 also makes sense, and I also didn't know that L1 mainly cached stacks. Sounds like maybe things might be different for later chips.

2. At first I actually placed one single DP unit in each partition, thinking it would be the obvious path to expand DP rate for big Maxwell. However Nvidia corrected me and said those DP units were outside the partition.

Well, that settles that then. Thanks for the clarifications!
 
Last edited by a moderator:
The Sandra latency benchmarks have been questionable before.
Their L1 numbers for AMD's VLIW GPUs are about double latencies measured in arithmetic loops, for example. The numbers were still very high, just not that high.
I haven't come across a reason why that was the case, or why this pattern of overestimation persists with the 270.

Comparing the tom's hardware latency graph with Haswell numbers here, it looks like NVidia's 24KB L1 latency is about 3x Haswell 6MB L3 latency, or comparable to Intel's 128MB off-chip L4 latency (for in-page random loads, which apparently means an access pattern that avoids TLB misses). And that's comparing just cycle counts; Haswell's cycle time is less than half that of the GM107.

I'm wondering if Sandra might be launching a bunch of warps/wavefronts that all do memory access and end up getting scheduled in round robin fashion, so that what Sandra really measures is something more like the size of the scheduler's queue of warps/wavefronts rather than actual cache latency.
 
Can i ask you, gm107 has DP dedicated cores? Because here i read this..:

We're not sure what to think about GM107's increasingly hobbled FP64 capabilities. You can either say double-precision performance is really bad, or the single-precision numbers are really good. Regardless, at the end of the day, artificial limitations meant to prevent cheap desktop cards from being viable workstation parts are no less irritating.

http://www.tomshardware.com/reviews/geforce-gtx-750-ti-review,3750-16.html


For be honest, i have some real doubt about the numbers we see on benchmarks of SP ( and DP ) on computing made so far with maxwell ( well the 750TI )... I have some pain to see how a 750TI with half the cudacores of the GTX680, can in SP tests got the same numbers in F@H ... with similar clock speed.... Specially when this is absolutely not traduct in any graphics bench... we are speaking there about OpenCL and not CUDA, so the CUDA version is not in cause... so drivers ? softwares ? bug ? In reality, computing benchmark are asbolutely all over the place in the benchmark i have seen.. From expected SP performance ( regarding the gpu is between a 750TI and 750TI boost ) to extremely surprising in this F@H SP results ( close to the 680 )
 
Last edited by a moderator:
From the wording of Tomshardware ('artificially hobbled') it's almost as if they believe that gm107 has more DP units than reported.

If NV keeps that pace & Maxwell cores below the GM200 have slightly more DP units then their predecessor than GM204 might end up way more powerful than I imagined so far.
 
If NV keeps that pace & Maxwell cores below the GM200 have slightly more DP units then their predecessor than GM204 might end up way more powerful than I imagined so far.

Not on GM204.. im 99,99% sure than on DP they will crippled it again and disable DP units.
 
Back
Top