In addition to just posting and link and telling us to hold our horses, can you give a hint what we should be looking at?
In addition to just posting and link and telling us to hold our horses, can you give a hint what we should be looking at?
I'm not sure I believe hardware.fr's diagrams on that point. I don't see any justification for their claims in the article, and they've also got the texture cache/L1 size at 24KB per SMM (half the amount per SMX), despite the fact that it is now apparently servicing memory reads/writes from the shader cores. Hopefully they follow up with details on how they came to their conclusions.
In addition to just posting and link and telling us to hold our horses, can you give a hint what we should be looking at?
I especially like the "new" illustration of the old SMX. Seems to clarify a lot of things - the old ones looked like someone just threw all the units in there. Also shows the split into subunits isn't really anything new at all with Maxwell but rather the principle is all the same, just what is shared by 2 subunits and what is not is different really (but that already changed with gk20x too for the TMUs). Does that come from some new nvidia marketing material?What's wrong with which of their diagrams? They look fine with me and for certain more accurate than most other stuff fabricated or copied out there.
The only thing that might curb our enthusiasm in this article is the idea that NVIDIA saved power by removing all the inter-GPC interconnect logic (just like in Tegra K1) which won't be possible in bigger Maxwell chips.
But I think most of the savings come from the redesigned SMs and extra L2 cache, so I'm not too worried about scaling.
There's a definitely a point behind it exactly because those interconnects are quite complex beasts; I'd love to stand corrected but if true "all" should be valid for K1 only.
That came straight from NVIDIA.Ah good catch I missed that.
computerbase.de tried to get that information, but nvidia refused to tell apparently (http://www.computerbase.de/artikel/grafikkarten/2014/nvidia-geforce-gtx-750-ti-maxwell-im-test/), so I wonder what anandtech's source is.
I especially like the "new" illustration of the old SMX. Seems to clarify a lot of things - the old ones looked like someone just threw all the units in there. Also shows the split into subunits isn't really anything new at all with Maxwell but rather the principle is all the same, just what is shared by 2 subunits and what is not is different really (but that already changed with gk20x too for the TMUs). Does that come from some new nvidia marketing material?
The same argument would hold for a GK107. I don't remember that one being praised for extraordinary perf/W and that bar chart with perf/W shows a GTX650 sitting at 62% of the 750Ti. One of the better ones of the other Keplers (that are more around the 56% point) but not earth shattering.The only thing that might curb our enthusiasm in this article is the idea that NVIDIA saved power by removing all the inter-GPC interconnect logic (just like in Tegra K1) which won't be possible in bigger Maxwell chips.
Yeah, I'm not oblivious to the fact that interconnect has a certain cost, but I don't think it's significant compared to something as intensive as an SM.But I think most of the savings come from the redesigned SMs and extra L2 cache, so I'm not too worried about scaling.
Ah good to know. I guess you spoke to a different person than Carsten did . All the more impressive then. Though I believe gk208 was actually the most power-efficient kepler chip but it is never included anywhere in comparisons since there's almost zero reviews of the useful (gt640 with 64bit gddr5) variant.That came straight from NVIDIA.
No HDMI 2.0, no hevc encode OR decode, optional displayport, and 128bit bus made me sad. Low idle power and noise make me happy. :shrug: YMMV. I don't play games, I do video editing, so I'm not a perfect match.
Where exactly do you see that "significant advantage"?
LTC Mining
What's wrong with which of their diagrams? They look fine with me and for certain more accurate than most other stuff fabricated or copied out there.
The texture cache is fixed at 12K per TMU quad and that's what NV has been using for many generations and apparently this is the presumed size for Maxwell, since there's no other new information on the subject.I'm especially interested in how they are determining L1 size and which units are shared.
That's a good point indeed, would seem to make more sense if they aren't shared. And while it might not matter much for this chip (as there's not much data to move) it seems like you wouldn't want to have shared DP units for the HPC chip (but presumably you'd want to retain the same general structure). FWIW it looks to me like the HPC chip would need to have either 1:2 or 1:4 DP/SP ratio in any case.2. They show 2 DP units shared between 2 blocks of 32 SP "cores". This would seem to require more cooperation between warp schedulers than strictly necessary, which the white-paper linked earlier in this thread talks about avoiding. It also turns DP ops into a variable latency thing. Why do that when each scheduler could just be given a single DP unit instead?
After looking at some reviews, I think this is a case of a very interesting chip that was put in a mediocre product.
The GM107 does seem to be quite a big step in terms of performance/watt. It seems to be nVidia's path into the notebook market where the GK107/GK208 might become irrelevant as soon as Broadwell comes out.
Moreover, it also seems to be a great step into the computing/mining market, where AMD scores all the points for the moment.. though looking at how the ASICs are taking over, this train may be already lost.
If they can ever translate these efficiency gains to the next Tegra, then that's even better. A Tegra M1 with the same performance as the K1 but with half the power consumption might even fit into a smartphone. Plus, the reduced memory bandwidth requirements due to the increased on-chip cache could make for an ever better performance upgrade on mobile, where GDDR5 speeds are still prohibitive.
Now the Geforce 750 and 750 Ti are... IMHO mediocre products.
Their performance/€ isn't any better than the competition or even nVidia's former product line.
Sure, the new cards are more power efficient but even for a 50W difference, it's not like the people who buy these cards will be using them 24/7, or even 8 hours/day.
Even for people who want this for an always-on media center, the card will be idling most of the time, and during the idle period most modern cards already use next to nothing.
As for PSUs, how many people will be able to use a 75W graphics card but not a 120W one?
And if nVidia really wanted this to be a media center card, they should've built the reference card as a low-profile model.
There are 2 things I find odd with their SMM diagram:
1. L1 capacity per SMM is now less than 24 KB (I say less, since it seems to now be shared with texture data that previously had its own read-only cache). Previously it could be configured as 16, 32 or 48 KB per SMX (IIRC). The reduction seems like it might introduce performance portability problems. I do understand that this chip is not primarily targeted at compute workloads. Maybe L1 is less important for graphics, and this just doesn't matter?
2. They show 2 DP units shared between 2 blocks of 32 SP "cores". This would seem to require more cooperation between warp schedulers than strictly necessary, which the white-paper linked earlier in this thread talks about avoiding. It also turns DP ops into a variable latency thing. Why do that when each scheduler could just be given a single DP unit instead?
Anyway, I'm not saying their diagrams are wrong. Just that I won't believe them until I've seen the tests they used that led them to draw things the way they have. I'm especially interested in how they are determining L1 size and which units are shared.
There are 2 things I find odd with their SMM diagram:
1. L1 capacity per SMM is now less than 24 KB (I say less, since it seems to now be shared with texture data that previously had its own read-only cache). Previously it could be configured as 16, 32 or 48 KB per SMX (IIRC). The reduction seems like it might introduce performance portability problems. I do understand that this chip is not primarily targeted at compute workloads. Maybe L1 is less important for graphics, and this just doesn't matter?
2. They show 2 DP units shared between 2 blocks of 32 SP "cores". This would seem to require more cooperation between warp schedulers than strictly necessary, which the white-paper linked earlier in this thread talks about avoiding. It also turns DP ops into a variable latency thing. Why do that when each scheduler could just be given a single DP unit instead?
Anyway, I'm not saying their diagrams are wrong. Just that I won't believe them until I've seen the tests they used that led them to draw things the way they have. I'm especially interested in how they are determining L1 size and which units are shared.