NVIDIA Maxwell Speculation Thread

What's wrong with 64 ROPs?
It's as much as Haiti. The next generation of GPUs will probably be branded as "4K GPUs".
The marketing teams will say these are the graphics cards we'll want for playing games on 4K TVs or multi-monitor setups, while keeping up with the "per-pixel" quality that the new consoles pump out at 900p/1080p.
It makes sense to start scaling the fillrate capabilities of the new GPUs.
 
There's nothing inherently wrong with 64 ROPs. ROPs though eat bandwidth for breakfast, lunch and dinner, so it makes quite sense to have some fixed number per MC (disregarding issues such as with some nvidia cpus which have more ROPs than the rasterizer or smx can potentially keep busy in most cases).
It should be relatively trivial to double the ROP number per MC without really changing the overall structure, so imho something which would theoretically be well possible for GM2xx. And, if they've got 4 GPC which can rasterize 16 pixels each (like GM107 but unlike Kepler), plus 16 SMM (4 pixels export per clock each) it could potentially be quite useful. So, given the bandwidth efficiency improvements of Maxwell, I wouldn't rule this completely out (it would also help fp32 blend without changing the rops themselves, because for this the Kepler/GM107 ROPs are so slow the performance is way below bandwidth limits in any case).
btw what is Haiti ;-).
 
Last edited by a moderator:
What's wrong with 64 ROPs?
It's as much as Haiti. The next generation of GPUs will probably be branded as "4K GPUs".
The marketing teams will say these are the graphics cards we'll want for playing games on 4K TVs or multi-monitor setups, while keeping up with the "per-pixel" quality that the new consoles pump out at 900p/1080p.
It makes sense to start scaling the fillrate capabilities of the new GPUs.

Even if they have 16 ROPs they will say that ... Hawaii have a 512bit bus and basically what is needed behind for 64 Rops, but before anything, Hawaii can deal with extremely large area of memory ( 16GB with really good efficiency) .. I dont know, maybe it could benefit Maxwell architecture with his big cache and all. But seriously im not sure of it.
 
Last edited by a moderator:
Obviously conflicts with the last rumour in unit counts (this one being 32 ROP's, 2048 cores and 128 TMU's).

I find these specs a little more believable tbh since this is basically just a GTX680 built on the Maxwell architecture (granting the increase in cores's per SMX) but if it's 10% faster than the 780Ti with those specs then WOW! That would be a huge efficiency boost assuming it's running at something around 1Ghz. Going from the firestrike scores it would make it around 70% faster than the 680 with the same number of SMX's with a lower TDP to boot!

I really hope this is true!

EDIT: If we assume clocks are more likely to be in 770 territory than 680 then the performance increase drops to around 55% over GK104 but using only 74% the TDP. Crazy!
 
And, if they've got 4 GPC which can rasterize 16 pixels each (like GM107 but unlike Kepler), plus 16 SMM (4 pixels export per clock each) it could potentially be quite useful.


Good call. Didn't realize GM107 was pumping out 16 pixels per clock from its one rasterizer.

nVidia's fp16 writes and blends are half-speed so doubling the ROPs is one way to close that gap. Kepler shows some benefit from L2 on blends but is pretty much ROP bound on all cards. Hawaii and Bonaire do much better given full speed fp16 and Tonga is in a whole different class.
 
I hope you're wrong :D

I could very well be, however try to convince me what you'd need 64 ROPs with relatively as little bandwidth and while you're at it can I have then 96 if not 128 ROPs on GM200?
 
I could very well be, however try to convince me what you'd need 64 ROPs with relatively as little bandwidth and while you're at it can I have then 96 if not 128 ROPs on GM200?

I thought you was referring at the entire specs (cc/tmu/rops) not only at rops
 
I could very well be, however try to convince me what you'd need 64 ROPs with relatively as little bandwidth and while you're at it can I have then 96 if not 128 ROPs on GM200?


If the tile/buffer currently being filled fits fully in L2 effective bandwidth for a blend is 2x off-chip bandwidth. Compression takes that even higher - Tonga's effective bandwidth on blends is over 200% theoretical max.
 
Tilers have a limit to how much geometry they can buffer so while the ideal implementation will bucket an entire frame this doesn't happen with a lot of geometry. Thus HSR might not be perfect.

You also don't need to bin after tessellation if you want to minimize geometry storage and can tolerate less accurate HSR. Another cost to binning before tessellation is patches will often cover multiple bins and you might need to re-tessellate. If you don't have an efficient way to tessellate part of a patch there will be a lot of duplicated work.

Edit: I forgot you mentioned displacement. That would make HSR prior to tessellation difficult. You could however sort during binning so z-buffering is likely to throw out a lot of the work.

HSR obviously happens after tessellation.
 
16C @ ~1,2GHz, 2MiB L2, 256-Bit 7GHz graphics card in SiSoft DB:
http://www.sisoftware.eu/rank2011d/...e6dbeacca499ac8af2cffed8bdd8e5d5f380bd85&l=en

There were some early 6.6GHz memory test runs in May, when GM204 bring-up was shipped at Zauba.

This cards has some massive Cryptography (High Security) BW and 1:32 DP performance.


Is it possible to read the L2 cache of NV cards through OpenCL/CUDA/NVAPI? There is even a 3GB / 1.5MiB version - maybe some BW scaling testing.
2MiB L2 would be a bit below average expectation...
 
Last edited by a moderator:
Back
Top