NVIDIA Maxwell Speculation Thread

Deleted member 13524 · Sep 6, 2014

What's wrong with 64 ROPs?
It's as much as Haiti. The next generation of GPUs will probably be branded as "4K GPUs".
The marketing teams will say these are the graphics cards we'll want for playing games on 4K TVs or multi-monitor setups, while keeping up with the "per-pixel" quality that the new consoles pump out at 900p/1080p.
It makes sense to start scaling the fillrate capabilities of the new GPUs.

mczak · Sep 6, 2014

There's nothing inherently wrong with 64 ROPs. ROPs though eat bandwidth for breakfast, lunch and dinner, so it makes quite sense to have some fixed number per MC (disregarding issues such as with some nvidia cpus which have more ROPs than the rasterizer or smx can potentially keep busy in most cases).
It should be relatively trivial to double the ROP number per MC without really changing the overall structure, so imho something which would theoretically be well possible for GM2xx. And, if they've got 4 GPC which can rasterize 16 pixels each (like GM107 but unlike Kepler), plus 16 SMM (4 pixels export per clock each) it could potentially be quite useful. So, given the bandwidth efficiency improvements of Maxwell, I wouldn't rule this completely out (it would also help fp32 blend without changing the rops themselves, because for this the Kepler/GM107 ROPs are so slow the performance is way below bandwidth limits in any case).
btw what is Haiti ;-).

lanek · Sep 6, 2014

ToTTenTranz said:
What's wrong with 64 ROPs?
It's as much as Haiti. The next generation of GPUs will probably be branded as "4K GPUs".
The marketing teams will say these are the graphics cards we'll want for playing games on 4K TVs or multi-monitor setups, while keeping up with the "per-pixel" quality that the new consoles pump out at 900p/1080p.
It makes sense to start scaling the fillrate capabilities of the new GPUs.

Even if they have 16 ROPs they will say that ... Hawaii have a 512bit bus and basically what is needed behind for 64 Rops, but before anything, Hawaii can deal with extremely large area of memory ( 16GB with really good efficiency) .. I dont know, maybe it could benefit Maxwell architecture with his big cache and all. But seriously im not sure of it.

xDxD · Sep 6, 2014

Just 170w tdp for +10% on 780ti (at same 28nm) on 3dmark seems very good at all

wccftech.com/geforce-gtx-980-alleged-benchmark-tdp-170w/

pjbliverpool · Sep 6, 2014

Obviously conflicts with the last rumour in unit counts (this one being 32 ROP's, 2048 cores and 128 TMU's).

I find these specs a little more believable tbh since this is basically just a GTX680 built on the Maxwell architecture (granting the increase in cores's per SMX) but if it's 10% faster than the 780Ti with those specs then WOW! That would be a huge efficiency boost assuming it's running at something around 1Ghz. Going from the firestrike scores it would make it around 70% faster than the 680 with the same number of SMX's with a lower TDP to boot!

I really hope this is true!

EDIT: If we assume clocks are more likely to be in 770 territory than 680 then the performance increase drops to around 55% over GK104 but using only 74% the TDP. Crazy!

boxleitnerb · Sep 6, 2014

TDP is not power consumption.

OlegSH · Sep 6, 2014

boxleitnerb said:
TDP is not power consumption.

It's 99.9% of it

boxleitnerb · Sep 6, 2014

OlegSH said:
It's 99.9% of it

No, because TDP is a static value - real world power is not but it is load and application dependent.

trinibwoy · Sep 6, 2014

mczak said:
And, if they've got 4 GPC which can rasterize 16 pixels each (like GM107 but unlike Kepler), plus 16 SMM (4 pixels export per clock each) it could potentially be quite useful.

Good call. Didn't realize GM107 was pumping out 16 pixels per clock from its one rasterizer.

nVidia's fp16 writes and blends are half-speed so doubling the ROPs is one way to close that gap. Kepler shows some benefit from L2 on blends but is pretty much ROP bound on all cards. Hawaii and Bonaire do much better given full speed fp16 and Tonga is in a whole different class.

Ailuros · Sep 6, 2014

xDxD said:
I hope you're wrong

I could very well be, however try to convince me what you'd need 64 ROPs with relatively as little bandwidth and while you're at it can I have then 96 if not 128 ROPs on GM200?

xDxD · Sep 6, 2014

Ailuros said:
I could very well be, however try to convince me what you'd need 64 ROPs with relatively as little bandwidth and while you're at it can I have then 96 if not 128 ROPs on GM200?

I thought you was referring at the entire specs (cc/tmu/rops) not only at rops

Ailuros · Sep 6, 2014

xDxD said:
I thought you was referring at the entire specs (cc/tmu/rops) not only at rops

With the endless guessing for =/>15 SMMs for the GM204 someone eventually will hit the jackpot LOL

trinibwoy · Sep 6, 2014

Ailuros said:
I could very well be, however try to convince me what you'd need 64 ROPs with relatively as little bandwidth and while you're at it can I have then 96 if not 128 ROPs on GM200?

If the tile/buffer currently being filled fits fully in L2 effective bandwidth for a blend is 2x off-chip bandwidth. Compression takes that even higher - Tonga's effective bandwidth on blends is over 200% theoretical max.

rpg.314 · Sep 7, 2014

trinibwoy said:
Doesn't all geometry amplification and displacement have to be completed before the final HSR pass?

Yes, but you don't have to stream out the amplified geometry. You can stream out just the patches.

rpg.314 · Sep 7, 2014

3dcgi said:
Tilers have a limit to how much geometry they can buffer so while the ideal implementation will bucket an entire frame this doesn't happen with a lot of geometry. Thus HSR might not be perfect.

You also don't need to bin after tessellation if you want to minimize geometry storage and can tolerate less accurate HSR. Another cost to binning before tessellation is patches will often cover multiple bins and you might need to re-tessellate. If you don't have an efficient way to tessellate part of a patch there will be a lot of duplicated work.

Edit: I forgot you mentioned displacement. That would make HSR prior to tessellation difficult. You could however sort during binning so z-buffering is likely to throw out a lot of the work.

HSR obviously happens after tessellation.

tviceman · Sep 7, 2014

Ailuros said:
With the endless guessing for =/>15 SMMs for the GM204 someone eventually will hit the jackpot LOL

I'm sticking with 20 SMM's. GK104 was 4x the cores of GK107, and GF114 was 4x GF107...

AnarchX · Sep 7, 2014

16C @ ~1,2GHz, 2MiB L2, 256-Bit 7GHz graphics card in SiSoft DB:
http://www.sisoftware.eu/rank2011d/...e6dbeacca499ac8af2cffed8bdd8e5d5f380bd85&l=en

There were some early 6.6GHz memory test runs in May, when GM204 bring-up was shipped at Zauba.

This cards has some massive Cryptography (High Security) BW and 1:32 DP performance.

Is it possible to read the L2 cache of NV cards through OpenCL/CUDA/NVAPI? There is even a 3GB / 1.5MiB version - maybe some BW scaling testing.
2MiB L2 would be a bit below average expectation...

Ailuros · Sep 7, 2014

As I asked elsewhere: is the application even reading out data correctly?

Megadrive1988 · Sep 7, 2014

http://videocardz.com/52166/nvidia-geforce-gtx-980-gtx-970-gtx-980m-gtx-970m-3dmark-performance

Grall · Sep 7, 2014

1:32 DP is very disappointing for distributed computing purposes.

NVIDIA Maxwell Speculation Thread

Deleted member 13524

Guest

mczak

lanek

xDxD

pjbliverpool

B3D Scallywag

boxleitnerb

OlegSH

boxleitnerb

trinibwoy

Meh

Ailuros

Epsilon plus three

xDxD

Ailuros

Epsilon plus three

trinibwoy

Meh

rpg.314

rpg.314

tviceman

AnarchX

Ailuros

Epsilon plus three

Megadrive1988

Grall

Invisible Member

Similar threads