NVIDIA Maxwell Speculation Thread

Tridam's excellent gtx 750 ti review is finally ready: http://www.hardware.fr/articles/916-1/nvidia-geforce-gtx-750-ti-gtx-750-maxwell-fait-ses-debuts.html
Interestingly, the fillrate test there does not mirror the uber high bandwidth efficiency of the fillrate test of 3dmark (as seen by anandtech). (Though Tridam's conclusion there are wrong, as he wrongly assumed number of ROPs were doubled. Especially fp32 blending is definitely just very slow, completely ROP bound and not bandwidth limited.)

Thanks, I fixed that! Of course the GM107 has 16 ROP as well but benefits from the bigger rasterizer and 4-5 SMM able to deliver 16-20 4 bytes pixels per clock. And you're correct that FP32 throughput is not limited by memory bandwidth, I just double checked that.

I'm using my own test. It shows the best case when it comes to data compression opportunities.
 
Thanks, I fixed that! Of course the GM107 has 16 ROP as well but benefits from the bigger rasterizer and 4-5 SMM able to deliver 16-20 4 bytes pixels per clock.
Ok. I still heavily disagree with the fp32 blend conclusion though :).
The bandwidth needed for 4xfp32 would be 4 times more than for 4xint8. Ok compression could change that but the result is way way lower. I believe Kepler/Maxwell can do fp32 blend only with 1/16 rate (but being able to use all resources for just one channel, hence 1/4 rate for single-channel fp32). This matches the actual numbers coming out of the test MUCH better than assuming it's bandwidth limited...
I'm using my own test. It shows the best case when it comes to data compression opportunities.
I'll assume no data locality though (the 3dmark result seems to indicate it could take advantage of large (ROP) cache to me).
 
Looks like I answered to this second part at the same time as you were posting hehe

I agree that everything points to the FP32 blending throughput being limited to 1/16. It's actually what I wrote previously.
 
Looks like I answered to this second part at the same time as you were posting hehe

I agree that everything points to the FP32 blending throughput being limited to 1/16. It's actually what I wrote previously.
Ok I agree then :).
GM107 definitely seems to make good use of its available resources. Now I realize the complexity (transistor count) is quite close to Bonaire (and performance is quite close too, though with 20% less memory bandwidth to boot), but all the raw numbers (peak fp rate, tmus, rasterization) are nearly identical to Cape Verde instead.
 
Looking at the folding@home benchmark (which employs an embrassingly parallel method) the computing efficiency of Maxwell has been improved quite significantly.

Faster than GF100 despite of roughly comparable gflops and half memory bandwidth:
http://www.anandtech.com/bench/product/1135?vs=1130

Roughly 1/2 of Titan's performance:
http://www.anandtech.com/bench/product/1060?vs=1130

3/4 the computing power of the AMD 290X here:
http://www.anandtech.com/bench/product/1056?vs=1130

Very efficient arch, would save people alot of time on optimization.
 
EVGA adds GeForce GTX 750 with 2GB and SC, Displayport connector

Bonus 2GB GDDR5 Memory on select EVGA GeForce GTX 750 cards.
NVIDIA G-SYNC Ready – the EVGA GeForce GTX 750 series have full support for NVIDIA G-SYNC Technology with included DisplayPort connector.
Copper Core Insert included on EVGA Superclocked range of 750 – lowers temperatures by 5 degrees Celsius

http://eu.evga.com/articles/00821/
 
Blender users are testing and benchmarking GM107, early results, seems to be close to GTX 570 at a much lower power usage. Still more testing to be done.
 
So it seems, results from Luxmark and SLG with higher complexity models seem to be valid performance indicators if done properly with the same driver revision.

edit:
As Warps are not tied to SIMD blocks but to schedulers (also in Kepler) it doesn't change register allocation at all, especially as the register file size and the maximum number of Warps per SMX/M didn't change. The additional ALUs of Kepler could only lead to a (most of the time) marginally faster execution, that's all.

Ah yes, right. Register contention occurs within the schedulers domain. So it should even be a bit worse now, with the slightly higher lifetime of each Warp compared to Kepler ("The additional ALUs of Kepler could only lead to a (most of the time) marginally faster execution, that's all."), yes?
 
http://forums.laptopvideo2go.com/topic/30757-hp-mobile-driver-33233/

NVIDIA_DEV.1340.2280.103C = "NVIDIA GeForce 830M"
NVIDIA_DEV.1340.2281.103C = "NVIDIA GeForce 830M "
NVIDIA_DEV.1340.2282.103C = "NVIDIA GeForce 830M "
NVIDIA_DEV.1341.21A0.103C = "NVIDIA GeForce 840M"
NVIDIA_DEV.1341.21DB.103C = "NVIDIA GeForce 840M "
NVIDIA_DEV.1341.21DC.103C = "NVIDIA GeForce 840M "
NVIDIA_DEV.1341.2280.103C = "NVIDIA GeForce 840M "
NVIDIA_DEV.1341.2281.103C = "NVIDIA GeForce 840M "
NVIDIA_DEV.1341.2282.103C = "NVIDIA GeForce 840M "
NVIDIA_DEV.1341.228C.103C = "NVIDIA GeForce 840M "
NVIDIA_DEV.1341.228D.103C = "NVIDIA GeForce 840M "
NVIDIA_DEV.1341.228E.103C = "NVIDIA GeForce 840M "

I believe Device-ID 134x is GM108.
 
Maybe an OEM card with lower TDP.
DDR3 card would really not deserve an 'X' at all :)
Well it could be both. If it's ddr3 and they do it "right" at least (that is, on a 4 SMM card lower the core clock a bit because it won't matter one bit anyway) tdp would get trivially down to like 30-40W.
I fully agree it wouldn't deserve a GTX designation, but I'm sceptical that would stop them...
 
The problem with luxmark is, just like lots other benchmarks, it only support Open CL, and NVIDIA's Open CL support is very poor, so its not a very good performance indicator for nvidia products thus shouldnt be a benchmark for cross-platform comparsion.

Whilst Folding@home support both Open CL and CUDA routines, thats why I picked it as a performance indicator for cross-platform comparisions.
 
The problem with luxmark is, just like lots other benchmarks, it only support Open CL, and NVIDIA's Open CL support is very poor, so its not a very good performance indicator for nvidia products thus shouldnt be a benchmark for cross-platform comparsion.

Whilst Folding@home support both Open CL and CUDA routines, thats why I picked it as a performance indicator for cross-platform comparisions.

It all depends on whether you want to test "theoretical performance" or real-life performance - if NVIDIAs OpenCL support is bad, it's bad, and reviews should point it out rather than just picking only software where the card performs well to test with
 
Back
Top