NVIDIA Maxwell Speculation Thread

I'm not sure I believe hardware.fr's diagrams on that point. I don't see any justification for their claims in the article, and they've also got the texture cache/L1 size at 24KB per SMM (half the amount per SMX), despite the fact that it is now apparently servicing memory reads/writes from the shader cores. Hopefully they follow up with details on how they came to their conclusions.

What's wrong with which of their diagrams? They look fine with me and for certain more accurate than most other stuff fabricated or copied out there.
 
In addition to just posting and link and telling us to hold our horses, can you give a hint what we should be looking at?

The only thing that might curb our enthusiasm in this article is the idea that NVIDIA saved power by removing all the inter-GPC interconnect logic (just like in Tegra K1) which won't be possible in bigger Maxwell chips.

But I think most of the savings come from the redesigned SMs and extra L2 cache, so I'm not too worried about scaling.
 
What's wrong with which of their diagrams? They look fine with me and for certain more accurate than most other stuff fabricated or copied out there.
I especially like the "new" illustration of the old SMX. Seems to clarify a lot of things - the old ones looked like someone just threw all the units in there. Also shows the split into subunits isn't really anything new at all with Maxwell but rather the principle is all the same, just what is shared by 2 subunits and what is not is different really (but that already changed with gk20x too for the TMUs). Does that come from some new nvidia marketing material?
 
The only thing that might curb our enthusiasm in this article is the idea that NVIDIA saved power by removing all the inter-GPC interconnect logic (just like in Tegra K1) which won't be possible in bigger Maxwell chips.

But I think most of the savings come from the redesigned SMs and extra L2 cache, so I'm not too worried about scaling.

There's a definitely a point behind it exactly because those interconnects are quite complex beasts; I'd love to stand corrected but if true "all" should be valid for K1 only.
 
There's a definitely a point behind it exactly because those interconnects are quite complex beasts; I'd love to stand corrected but if true "all" should be valid for K1 only.

K1 can do without any inter-SM logic (and wires) whereas GM107 could only get rid of inter-GPC stuff. But from what I understand, it's the distributed geometry that's tricky, i.e. the inter-GPC part.

I sure would love for NVIDIA to give us more information.
 
I especially like the "new" illustration of the old SMX. Seems to clarify a lot of things - the old ones looked like someone just threw all the units in there. Also shows the split into subunits isn't really anything new at all with Maxwell but rather the principle is all the same, just what is shared by 2 subunits and what is not is different really (but that already changed with gk20x too for the TMUs). Does that come from some new nvidia marketing material?

Not really. The marketing material was this big 256 (sic!) block of green squares that was Kepler and the 4x 32 with indiviual control that is Maxwell. I guess the very good diagrams from Damien come from clever analysis of more than Press_Preso.pdf copy-and-paste that many sites resorted to.
 
The only thing that might curb our enthusiasm in this article is the idea that NVIDIA saved power by removing all the inter-GPC interconnect logic (just like in Tegra K1) which won't be possible in bigger Maxwell chips.
The same argument would hold for a GK107. I don't remember that one being praised for extraordinary perf/W and that bar chart with perf/W shows a GTX650 sitting at 62% of the 750Ti. One of the better ones of the other Keplers (that are more around the 56% point) but not earth shattering.

But I think most of the savings come from the redesigned SMs and extra L2 cache, so I'm not too worried about scaling.
Yeah, I'm not oblivious to the fact that interconnect has a certain cost, but I don't think it's significant compared to something as intensive as an SM.
 
That came straight from NVIDIA.
Ah good to know. I guess you spoke to a different person than Carsten did :). All the more impressive then. Though I believe gk208 was actually the most power-efficient kepler chip but it is never included anywhere in comparisons since there's almost zero reviews of the useful (gt640 with 64bit gddr5) variant.
 
No HDMI 2.0, no hevc encode OR decode, optional displayport, and 128bit bus made me sad. Low idle power and noise make me happy. :shrug: YMMV. I don't play games, I do video editing, so I'm not a perfect match.

I think EVGA offers HDMI, Displayport with their 750 product offerings ... some even have "bonus 6-pin power input connector provides an additional 25 watts, giving you an increase in power of 35%!"

http://www.guru3d.com/news_story/evga_geforce_gtx_750_and_gtx_750_ti.html
 
Where exactly do you see that "significant advantage"?

LTC Mining

Litecoin.png

I suppose it depends on which chip you're comparing to which chip, ExtremeTech tested khash/watt against R9 270, and R9 270 won (even though just barely)
http://www.extremetech.com/gaming/1...per-efficient-quiet-a-serious-threat-to-amd/3
Litecoin.png


LiteCoinEfficiency.png


Tom's numbers for Radeons seem a tad on the low side based on R9 270 at least, but one has to remember that there's ~10% differences from 1 card to the next within same models and slight adjustments to clocks or voltages can cause huge variations (well beyond 10%) too, for good or bad
 
What's wrong with which of their diagrams? They look fine with me and for certain more accurate than most other stuff fabricated or copied out there.

There are 2 things I find odd with their SMM diagram:

1. L1 capacity per SMM is now less than 24 KB (I say less, since it seems to now be shared with texture data that previously had its own read-only cache). Previously it could be configured as 16, 32 or 48 KB per SMX (IIRC). The reduction seems like it might introduce performance portability problems. I do understand that this chip is not primarily targeted at compute workloads. Maybe L1 is less important for graphics, and this just doesn't matter?

2. They show 2 DP units shared between 2 blocks of 32 SP "cores". This would seem to require more cooperation between warp schedulers than strictly necessary, which the white-paper linked earlier in this thread talks about avoiding. It also turns DP ops into a variable latency thing. Why do that when each scheduler could just be given a single DP unit instead?

Anyway, I'm not saying their diagrams are wrong. Just that I won't believe them until I've seen the tests they used that led them to draw things the way they have. I'm especially interested in how they are determining L1 size and which units are shared.
 
The primary task of L1 on Kepler is a cache for register spilling, whilst in CUDA, texture cache serves the purpose of L1 data cache commonly seeing in other computing archs.

So combining L1 and texture cache together on Maxwell seems a reasonable step for me, also dont forget Maxwell significant increases L2 cache size and reduce the latency of it as well.

I am quite excited about the possiblity of significant improvement of the bit-wise ops and integer ops on Maxwell, through still need more evdience to prove that.
 
I'm especially interested in how they are determining L1 size and which units are shared.
The texture cache is fixed at 12K per TMU quad and that's what NV has been using for many generations and apparently this is the presumed size for Maxwell, since there's no other new information on the subject.
 
After looking at some reviews, I think this is a case of a very interesting chip that was put in a mediocre product.


The GM107 does seem to be quite a big step in terms of performance/watt. It seems to be nVidia's path into the notebook market where the GK107/GK208 might become irrelevant as soon as Broadwell comes out.
Moreover, it also seems to be a great step into the computing/mining market, where AMD scores all the points for the moment.. though looking at how the ASICs are taking over, this train may be already lost.

If they can ever translate these efficiency gains to the next Tegra, then that's even better. A Tegra M1 with the same performance as the K1 but with half the power consumption might even fit into a smartphone. Plus, the reduced memory bandwidth requirements due to the increased on-chip cache could make for an ever better performance upgrade on mobile, where GDDR5 speeds are still prohibitive.




Now the Geforce 750 and 750 Ti are... IMHO mediocre products.
Their performance/€ isn't any better than the competition or even nVidia's former product line.
Sure, the new cards are more power efficient but even for a 50W difference, it's not like the people who buy these cards will be using them 24/7, or even 8 hours/day.
Even for people who want this for an always-on media center, the card will be idling most of the time, and during the idle period most modern cards already use next to nothing.
As for PSUs, how many people will be able to use a 75W graphics card but not a 120W one?

And if nVidia really wanted this to be a media center card, they should've built the reference card as a low-profile model.
 
2. They show 2 DP units shared between 2 blocks of 32 SP "cores". This would seem to require more cooperation between warp schedulers than strictly necessary, which the white-paper linked earlier in this thread talks about avoiding. It also turns DP ops into a variable latency thing. Why do that when each scheduler could just be given a single DP unit instead?
That's a good point indeed, would seem to make more sense if they aren't shared. And while it might not matter much for this chip (as there's not much data to move) it seems like you wouldn't want to have shared DP units for the HPC chip (but presumably you'd want to retain the same general structure). FWIW it looks to me like the HPC chip would need to have either 1:2 or 1:4 DP/SP ratio in any case.
Though as for sharing, technically you wouldn't need to share TMUs neither. While it is true that due to the pixel quad processing you really need quad tmus, nothing requires them to deliver 4 filtered outputs per clock - you could easily have 4 quad tmus requiring 2 clocks for delivering the results. I don't think though I've seen such designs in anything but the ultra slow category (that is, chips not being capable of delivering 4 filtered texels per clock in total), and there's probably a reason for this...
 
After looking at some reviews, I think this is a case of a very interesting chip that was put in a mediocre product.


The GM107 does seem to be quite a big step in terms of performance/watt. It seems to be nVidia's path into the notebook market where the GK107/GK208 might become irrelevant as soon as Broadwell comes out.
Moreover, it also seems to be a great step into the computing/mining market, where AMD scores all the points for the moment.. though looking at how the ASICs are taking over, this train may be already lost.

If they can ever translate these efficiency gains to the next Tegra, then that's even better. A Tegra M1 with the same performance as the K1 but with half the power consumption might even fit into a smartphone. Plus, the reduced memory bandwidth requirements due to the increased on-chip cache could make for an ever better performance upgrade on mobile, where GDDR5 speeds are still prohibitive.




Now the Geforce 750 and 750 Ti are... IMHO mediocre products.
Their performance/€ isn't any better than the competition or even nVidia's former product line.
Sure, the new cards are more power efficient but even for a 50W difference, it's not like the people who buy these cards will be using them 24/7, or even 8 hours/day.
Even for people who want this for an always-on media center, the card will be idling most of the time, and during the idle period most modern cards already use next to nothing.
As for PSUs, how many people will be able to use a 75W graphics card but not a 120W one?

And if nVidia really wanted this to be a media center card, they should've built the reference card as a low-profile model.

You think it's a mediocre product because you're an elitist and don't care about cards with such level of performance (sorry for phrasing it like that)
It's fine, e.g. you can put it in a micro ATX tower without clogging some of the drive bays and use a 300W PSU.
 
There are 2 things I find odd with their SMM diagram:

1. L1 capacity per SMM is now less than 24 KB (I say less, since it seems to now be shared with texture data that previously had its own read-only cache). Previously it could be configured as 16, 32 or 48 KB per SMX (IIRC). The reduction seems like it might introduce performance portability problems. I do understand that this chip is not primarily targeted at compute workloads. Maybe L1 is less important for graphics, and this just doesn't matter?

2. They show 2 DP units shared between 2 blocks of 32 SP "cores". This would seem to require more cooperation between warp schedulers than strictly necessary, which the white-paper linked earlier in this thread talks about avoiding. It also turns DP ops into a variable latency thing. Why do that when each scheduler could just be given a single DP unit instead?

Anyway, I'm not saying their diagrams are wrong. Just that I won't believe them until I've seen the tests they used that led them to draw things the way they have. I'm especially interested in how they are determining L1 size and which units are shared.

Earlier, L2 per core was almost the same as L1, so L1 misses almost always went to DRAM. Now there is a large L2, so they can cut back on L1 without increasing overall miss rate.
 
There are 2 things I find odd with their SMM diagram:

1. L1 capacity per SMM is now less than 24 KB (I say less, since it seems to now be shared with texture data that previously had its own read-only cache). Previously it could be configured as 16, 32 or 48 KB per SMX (IIRC). The reduction seems like it might introduce performance portability problems. I do understand that this chip is not primarily targeted at compute workloads. Maybe L1 is less important for graphics, and this just doesn't matter?

2. They show 2 DP units shared between 2 blocks of 32 SP "cores". This would seem to require more cooperation between warp schedulers than strictly necessary, which the white-paper linked earlier in this thread talks about avoiding. It also turns DP ops into a variable latency thing. Why do that when each scheduler could just be given a single DP unit instead?

Anyway, I'm not saying their diagrams are wrong. Just that I won't believe them until I've seen the tests they used that led them to draw things the way they have. I'm especially interested in how they are determining L1 size and which units are shared.

I won't claim obviously that all details in those diagrams are correct. There are still some details Nvidia is not willing to share that are difficult to extract from testing :/ However, compared to the official material, I think they get much closer to the reality :)

1. Nvidia confirmed to me it is 12 KiB per 4 TMU blocks. I actually asked them if that wouldn't be a perf issue for some compute task but they feel it shouldn't be an issue for THIS particular GPU. Plus don't forget that register pressure will be lowered a little bit thanks to the shorter arithmetic pipeline.

2. At first I actually placed one single DP unit in each partition, thinking it would be the obvious path to expand DP rate for big Maxwell. However Nvidia corrected me and said those DP units were outside the partition.
 
Back
Top