NVIDIA Maxwell Speculation Thread

Very impressive indeed! If Maxwell is true to form and Mid to High range cards scale similarly regarding performance/efficiency this new architecture will definitely change much more then initially anticipated. :)
 
Gainward GTX 750 for sale at 99.9 euros taxes included, in my country. That's not so bad.
Now, the problem is any model I've looked up (750 or Ti) has three or four video outputs, but not DP 1.2.

Gigabyte has two DVI + two HDMI (one of the connectors is dual link DVI-I and the other single link DVI-D. meaning you can't use a dual link display and a VGA display at the same time). Asus has two DL-DVI plus VGA plus HDMI (but is most expensive). Other cards are more usual.
 
The only vendor that is somewhat ok is EVGA with 1 DisplayPort 1.2 port. The rest of them are terrible with VGA connector in 2014. :rolleyes:

http://www.evga.com/articles/00821/#3751

http://www.evga.com/products/images/gallery/02G-P4-3751-KR_XL_4.jpg

No forward looking vendors at all with 3 DisplayPort 1.2 and 1 HDMI. Nvidia really needs to lay down the law and move away from DVI-I/D.

http://www.techpowerup.com/reviews/ASUS/GTX_750_Ti_OC/30.html

I also found memory overclocking on our sample to be quite limited, worse than other GTX 750 Ti cards using the same memory chips. So, even though the 6-pin power connector suggests superior overclocking capability, it doesn't work that way. I've tested other GTX 750 Ti cards that are better suited if you plan to overclock. Don't get me wrong, overclocking is not terrible as it still provides a good 10% real life performance increase, but I was expecting more.

Yeah, don't even buy Asus cards, terrible.
 
Last edited by a moderator:
As long as they use the same timings, no. I don't think there's really any difference between Kepler, Maxwell or newer AMD cards there from a hardware point of view, the problem is always the same (can't reclock the memory if vblank period isn't synchronized - typically for dvi monitors this means they need to be driven from the same clock source, and nvidia definitely did that earlier, it's also possible the card bios needs to play along not just the driver I'm not really sure there).

Sry for being somewhat off topic, but anyone knows what's exactly needed to have the monitors in sync and the 290 run the low clocks? Since it appears to be working for some with different monitors, but not for my 2 (different) 60hz monitors... are some combinations of outputs (tried dvi+hdmi and dvi+dvi for now) better than others etc?
 
Last edited by a moderator:
The EDID timings (from the panels) need to match. Many timings are fairly common, so you can often find different panels running the same timings at the same resolution.
 
Sry for being somewhat off topic, but anyone knows what's exactly needed to have the monitors in sync and the 290 run the low clocks? Since it appears to be working for some with different monitors, but not for my 2 (different) 60hz monitors... are some combinations of outputs (tried dvi+hdmi and dvi+dvi for now) better than others etc?
They have to be perfectly in sync. Same resolution, same timings, same polarity. It's similar to NVIDIA's requirements for their Surround modes.
 
ok, thanks.. so no chance when they are different resolutions...

But the nvidia cards still seem able to do the lowered memclock when running 2 monitors with different resolutions and timings (but not with 3) - http://www.techpowerup.com/reviews/ASUS/GTX_750_Ti_OC/23.html http://ht4u.net/reviews/2014/nvidia_geforce_gtx_750_ti_im_test/index17.php . Just wondering how they retrain the gddr5 then.
Oh you're right I missed that - so forget that I said there's no difference.
My _guess_ would be that they simply have a large enough line buffer for scanout (for one monitor) so they don't need to wait for vblank interval. Though I think you'd need something in the 100kB range for that which sounds a bit expensive (of course this would be resolution and refresh rate dependent). Maybe they could exploit the L2 cache for this instead. I could be very wrong though :).
Thinking about it, I like the large line buffer idea actually. The reason is that this might help for performance a tiny bit too (because you can lower priority for memory requests coming from the display controller, as long as your buffer is "full enough"). There are actually AMD APUs out there which have barely adequate line buffers (on these you could run into issues with two high-res displays depending on some other factors) because while memory bandwidth was still enough it was difficult to fill the line buffer in a timely manner due to other outstanding memory requests. And it's not difficult to imagine this will cost some performance if you have to give highest priority to memory requests originating from the display controller practically always even if the line buffer is still full, rather than being able to wait a bit when there potentially aren't other outstanding requests (this is all configurable, IIRC this was called "display watermarks" by AMD).
 
Last edited by a moderator:
Apparently the 750ti is a great card for bitcoin mining thanks to its performance per watt ratio. I guess the price of this card will skyrocket in the US.

perfpowerxdj4o.png


Thankfully we don't have this problem in europe (at least in Spain) were prices of AMD cards have remained stable.

Kind of dissapointed we won't get mid-high range Maxwell card probably until next year. :cry:
 
Comparing the tom's hardware latency graph with Haswell numbers here, it looks like NVidia's 24KB L1 latency is about 3x Haswell 6MB L3 latency, or comparable to Intel's 128MB off-chip L4 latency (for in-page random loads, which apparently means an access pattern that avoids TLB misses). And that's comparing just cycle counts; Haswell's cycle time is less than half that of the GM107.

I'm wondering if Sandra might be launching a bunch of warps/wavefronts that all do memory access and end up getting scheduled in round robin fashion, so that what Sandra really measures is something more like the size of the scheduler's queue of warps/wavefronts rather than actual cache latency.
Irrespective of wether or not the Sandra numbers are correct, it is to be expected that CPUs have a much lower latency that GPUs: latency is absolutely critical for a CPU to prevent stalls. Not so for a GPU that usually has plenty of latency hiding work. It'd be a waste of resources to optimize a GPU for a low latency cache that isn't strictly needed.
 
Apparently the 750ti is a great card for bitcoin mining thanks to its performance per watt ratio. I guess the price of this card will skyrocket in the US.

perfpowerxdj4o.png


Thankfully we don't have this problem in europe (at least in Spain) were prices of AMD cards have remained stable.

Kind of dissapointed we won't get mid-high range Maxwell card probably until next year. :cry:

If these results were for Litecoin or Dogecoin, yes, but Bitcoin is entirely dominated by ASICs. I think they're even taking over Litecoin, or about to.
 
He compare apple to organge.The latencies shown in Tomshardware is simply the overall latency for random memory accesss, it doesnt say whether is from L1, L2 or global memory.

While there is no deny that the strict global memory latency and L2 latency of GPU are definitely higher, I would say the L1/shared memory latencies should be comparable between the two archs, not to mention the advantage having a cache being programmable at L1 latency.
 
Apparently the 750ti is a great card for bitcoin mining thanks to its performance per watt ratio. I guess the price of this card will skyrocket in the US.

perfpowerxdj4o.png

1) No GPU is great for bitcoin mining. They all suck; dedicated ASICS have made GPUs way too slow, the cost of the electricity bill is many times higher than the price of the mined bitcoints.

For litecoins situation might be different though, but I don't know for sure.

2) The graph/benchmarks sucks.
They have just benchmarked the performance and divided it with the TDP. That gives very unreliable results.
You can immediately see this when they show that overclocked is giving better numbers. But overclocking always decreases power efficiency when you have to increase voltage.

In order to see real performance/power results you have to really benchmark the actual power usage during the task you are executing.

TDP number is only meant to be used for selecting power supply and cooling solution, it gives the maximum, not "normal" power consumption.
 
It's true that the tom's hardware latency graph doesn't explicitly say what HW structure accesses are going to (how would Sandra know?). But the working set size is indicated, and there's a distinct jump in latency at 12KB and 2MB, which matches the L1 and L2 sizes that have been talked about in architecture previews.

I do realize that latency is not as important for a GPU's L1 cache as it is for a CPU's. I was comparing the two because the (claimed) latency differences struck me as ridiculous in spite of that. Point taken regarding apples versus oranges though.
 
Oh you're right I missed that - so forget that I said there's no difference.
My _guess_ would be that they simply have a large enough line buffer for scanout (for one monitor) so they don't need to wait for vblank interval. Though I think you'd need something in the 100kB range for that which sounds a bit expensive (of course this would be resolution and refresh rate dependent). Maybe they could exploit the L2 cache for this instead. I could be very wrong though :).

Sounds about right. For 1080p 100kb must be around the same time as the vblank, and since all the cards behave identical (possible with 2 displays, but not 3) i guess it's a fixed buffer in the dedicated scan-out part and not the L2(s) - otherwise the big guns should have plenty of available cache for 3 displays (and more reason to do it).. Isn't 100kb at a (comparable) very low speed quite small? And as you mention you maybe also get some performance by the reduced priority from the scan out.

2) The graph/benchmarks sucks.
They have just benchmarked the performance and divided it with the TDP. That gives very unreliable results.
You can immediately see this when they show that overclocked is giving better numbers. But overclocking always decreases power efficiency when you have to increase voltage.

Yup, plus the fact that you got a limited number of pcie's in a system so requiring 3 times the cards for the same throughput isn't practical.
Actually got the power meter on the 290 machine now, and it's drawing 360w from the wall while mining at 860khash/s - that's probably max 250w for the card. gpu-z hovers around 210w.
But at least they got the cgminer numbers more right than Toms this time ;)
 
Last edited by a moderator:
Apparently the 750ti is a great card for bitcoin mining thanks to its performance per watt ratio. I guess the price of this card will skyrocket in the US.

perfpowerxdj4o.png


Thankfully we don't have this problem in europe (at least in Spain) were prices of AMD cards have remained stable.

Kind of dissapointed we won't get mid-high range Maxwell card probably until next year. :cry:

It's not that simple, ExtremeTechs results for example disagree ( http://www.extremetech.com/gaming/1...per-efficient-quiet-a-serious-threat-to-amd/3 )
LiteCoinEfficiency.png


The big problem with mining benchmarks, including per watt, is that there's huge differences even between "identical" cards, 10%+ is easily there, and apparently BIOS-versions make huge differences, too (20-30%+)
 
I mine scrypt coins at 890KH/s per card on 2x R9 290X with system power consumption of 695W. Mind you this is my normal gaming PC so it has overclocked i5, 2xSSD, 2xHDD, BlueRay RW and lots of fans, card readers, USB stuff. Cards on their own are probably around 220-240W each which gives efficiency of around 3.7KH/s/W.

Cards with Hynix memory can mine at speeds of up to 990KH/s while power consumption stays the same as Elpida cards.

One more thing, has anyone measured nVidia cards using cGminer and OpenCL? Has the newest driver increased mining rate similar to how LuxRenderer speed up?
 
Back
Top